In [ ]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import sklearn
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
import scipy
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
rng = np.random.default_rng()
import pip
import os
from tqdm.notebook import tqdm, trange
tqdm.pandas()
import seaborn as sns
sns.set_theme()
sns.set_style("whitegrid")
import matplotlib as mpl
import matplotlib.pyplot as plt
import folium
from folium import plugins

pd.options.display.max_columns = None
pd.options.display.max_rows = None

No description has been provided for this image

Investigating Relationship Between Weather and Criminal Offense Trends in New Orleans¶

Hayden Outlaw, Joe Wagner | Tulane CMPS 6790 Data Science Milestone 2 | Fall 2023¶

https://outlawhayden.github.io/weather-crime¶

Project Outline¶

In New Orleans, worse weather is conventionally understood to bring an increase in crime across the city. For example, WDSU reported on higher temperatures leading to higher murder rates, and there was a wide variety of crime that arose after Hurricane Katrina. With specifically violent crime surging in recent years, officals are turning to every possible confounding factor for criminality, and there is a strong demand to understand the relationship between incidents of this kind and their causes.

The investigation between weather and criminal activity is almost as old as the concept of data-scientific investigations itself. Albert Quetelet posited a "Thermic Law of Delinquency" as far back as 1842, with criminologists probing the relationship between the two consistently ever since. Based on a survey by Corcoran and Zahnow, since 1842 and 2021, 200 studies on the topic have been published, predominantly in journal articles on criminology and psychology. From the same survey, 56.9% of the studies examined the weather-crime association in North America, and 42.8% of studies were on city-wide scales. 41.7% of these studies employ descriptive analysis, with 77.6% of them using multiple empirical elements followed by a modelling component. Within cities such as Philadelphia, Dallas, and Baltimore00418-0/fulltext), a clear relationship between weather and crime has been identified by research groups (specifically between temperature and violence), which points to the informal knowledge having a more rigorous backing that could extrapolate to the city of New Orleans.

A study by Cohn in 1990 on the influence of weather and temporal variables on domestic violence outlines four key considerations for weather-crime research: theoretical grounding, operational measures of time and weather, temporal granularity, and statistical techniques. The vast majority of similar studies occur around the 1990s, with only a recent resurgence - given the widespread availability of big-data tools and data sources, it is now much easier to meet her proposed priorities for such a study than they were thirty years ago. Corcoran and Zahnow assert that many of these studies from this time do not successfully focus enough on these four criteria, and they also fail to control for things such as time of day, live weather at the time of the crime, or bias from imperfect data sources.

No study of this kind exists for New Orleans, despite the city's uniquely strong concerns with regards to both crime rates and weather events. The majority of connections discovered surrounding weather and crime correlate temperature and violence, but also are most often found in northern cities with wider seasonal patterns that don't apply to New Orleans. Whether criminal activity in New Orleans contains a similar pattern, undiscovered patterns unique to the area, or no pattern at all, is yet to be determined. With public data sources, we aim to investigate the relationship between weather trends and criminal activity from 2011 until the present. We will use the New Orleans Police Department service call records, alongside NOAA daily weather reports for various stations throughout the city, which we will load, extract, and parse. With these data, there are a wide variety of questions that could be investigated, such as:

  • Does the relationship between higher temperatures and violent crimes extend to New Orleans?
  • Does the presence of weather affect criminal activities during it's ocurrence, or does it affect them in the future as well?
  • Do individual weather events have as much of an effect on criminal activity as larger climate or seasonal trends?
  • If a relationship between weather and crime exists, which parts of the city geographically does it affect most? Which portions are most insulated?

Below, we outline our collaboration plan, our data sources, and our initial extraction of some information.

Collaboration Plan¶

To collaborate, we intend to utilize two primary tools. The first is a Github repository, which will handle code sharing, versioning control, organization, and publication. All of our code, tools, and assets are publicly available here: GITHUB REPOSITORY

For live programming collaboration, we intend to use Visual Studio Code Live Share which allows for live simultaneous code editing. We also intend to meet twice a week in person to commit to broader project planning and directional goals.

New Orleans Police Department Calls for Service¶


The first half of data that we require to investigate this relationship is crime data from New Orleans. Data Driven NOLA hosts publications of all New Orleans Police Department calls for service from 2011 to the present. The data is sanitized of any personal identifiers, but contains location, time, priority, and incident type information. Each year is hosted separately - and cumulatively, the dataset is too large for us to host. To download the data and run the notebook, the manual download sources are below:

  • 2011 Calls for Service
  • 2012 Calls for Service
  • 2013 Calls for Service
  • 2014 Calls for Service
  • 2015 Calls for Service
  • 2016 Calls for Service
  • 2017 Calls for Service
  • 2018 Calls for Service
  • 2019 Calls for Service
  • 2020 Calls for Service
  • 2021 Calls for Service
  • 2022 Calls for Service
  • 2023 Calls for Service

To download the data, go to Export -> CSV.

The following script takes all of the .csv files in the folder location data_folder, and stitches them together into one large dataframe to be cached as calls_master.csv and then loaded. To load the data, save all of the exported spreadsheets as .csv files into the location of data_folder, and then run the cell.

In [ ]:
# location of data 
data_folder = '../data/calls_for_service'
# paths for all csv files in data_folder
csv_files = [f for f in os.listdir(data_folder) if f.endswith('csv')]
# if compiled csv file does not already exist
if 'calls_master.csv' not in csv_files:
    # make empty dataframe
    calls_for_service = pd.DataFrame()
    # combine all files in folder into one large dataframe
    for f in tqdm(csv_files, desc = "Combining Files"):
        file_path = os.path.join(data_folder, f)
        df = pd.read_csv(file_path)
        calls_for_service = pd.concat([calls_for_service, df], ignore_index = True)
    # export to combined csv file
    calls_for_service.to_csv('../data/calls_for_service/calls_master.csv')
else:
    # if compiled csv already exists, just load that
    calls_for_service = pd.read_csv(os.path.join(data_folder, 'calls_master.csv'))
In [ ]:
calls_for_service.head()
Out[ ]:
Unnamed: 0 NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival
0 0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 01/28/2020 01:37:20 AM 01/28/2020 01:37:20 AM 01/28/2020 01:37:28 AM 01/28/2020 02:25:50 AM NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN
1 1 A0000220 21 COMPLAINT OTHER 1J 21 COMPLAINT OTHER 1J 3668710.0 533007.0 01/01/2020 12:00:42 AM 01/01/2020 12:00:42 AM 01/01/2020 12:00:42 AM 01/01/2020 01:37:16 AM NAT Necessary Action Taken Y 2U04 034XX Broadway St 70125.0 2 POINT (-90.10840522 29.95996774) NaN NaN
2 2 A2190820 22A AREA CHECK 1K 22A AREA CHECK 1K 3682445.0 530709.0 01/17/2020 09:18:41 PM 01/17/2020 09:18:41 PM 01/17/2020 09:18:47 PM 01/17/2020 09:18:54 PM NAT Necessary Action Taken N 8B02 N Peters St & Bienville St 70130.0 8 POINT (-90.065113 29.95323762) NaN NaN
3 3 A2874820 21 COMPLAINT OTHER 2A 21 COMPLAINT OTHER 1J 3737616.0 590067.0 01/23/2020 10:19:48 AM 01/23/2020 10:22:05 AM 01/23/2020 10:31:11 AM 01/23/2020 10:34:35 AM GOA GONE ON ARRIVAL N 7L08 I-10 E 70129.0 7 POINT (-89.88854843 30.11465463) NaN NaN
4 4 A2029120 34S AGGRAVATED BATTERY BY SHOOTING 2C 34S AGGRAVATED BATTERY BY SHOOTING 2C 3696210.0 551411.0 01/16/2020 05:09:05 PM 01/16/2020 05:09:43 PM 01/16/2020 05:16:07 PM 01/16/2020 10:49:37 PM RTF REPORT TO FOLLOW N 7A01 Chef Menteur Hwy & Downman Rd 70126.0 7 POINT (-90.02090137 30.00973449) NaN NaN

According to Data Driven NOLA, the default attributes are described as:

  • NOPD_Item: The NOPD unique item number for the incident.
  • Type: The NOPD Type associated with the call for service.
  • TypeText: The NOPD TypeText associated with the call for service.
  • Priority: The NOPD Priority associated with the call for service. Code 3 is considered the highest priority and is reserved for officer needs assistance. Code 2 are considered "emergency" calls for service. Code 1 are considered "non-emergency" calls for service. Code 0 calls do not require a police presence. Priorities are differentiated further using the letter designation with "A" being the highest priority within that level.
  • InitialType: The NOPD InitialType associated with the call for service.
  • InitialTypeText: The NOPD InitialTypeText associated with the call for service.
  • InitialPriority: The NOPD InitialPriority associated with the call for service. See Priority description for more information.
  • MapX: The NOPD MapX associated with the call for service. This is provided in state plane and obscured to protect the sensitivity of the data.
  • MapY: The NOPD MapY associated with the call for service. This is provided in state plane and obscured to protect the sensitivity of the data.
  • TimeCreate: The NOPD TimeCreate associated with the call for service. This is the time stamp of the create time of the incident in the CAD system.
  • TimeDispatch: The NOPD TimeDispatch associated with the call for service. This is the entered time by OPCD or NOPD when an officer was dispatched.
  • TimeArrive: The NOPD TimeArrive associated with the call for service. This is the entered time by OPCD or NOPD when an officer arrived.
  • TimeClosed: The NOPD TimeClosed associated with the call for service. This is the time stamp of the time the call was closed in the CAD system.
  • Disposition: The NOPD Disposition associated with the call for service.
  • DispositionText: The NOPD DispositionText associated with the call for service.
  • SelfInitiated: The NOPD SelfInitiated associated with the call for service. A call is considered self-initiated if the Officer generates the item in the field as opposed to responding to a 911 call.
  • Beat: The NOPD Beat associated with the call for service. This is the area within Orleans Parish that the call for service occurred. The first number is the NOPD District, the letter is the zone, and the numbers are the subzone.
  • BLOCK_ADDRESS: The BLOCK unique address number for the incident. The block address has been obscured to protect the sensitivity of the data.
  • Zip: The NOPD Zip associated with the call for service.
  • PoliceDistrict: The NOPD PoliceDistrict associated with the call for service.
  • Location: The NOPD Location associated with the call for service. The X,Y coordinates for the call for service obscured to protect the sensitivity of the data.

The dataset is large, with more than 5000000 rows.

In [ ]:
calls_for_service.head()
Out[ ]:
Unnamed: 0 NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival
0 0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 01/28/2020 01:37:20 AM 01/28/2020 01:37:20 AM 01/28/2020 01:37:28 AM 01/28/2020 02:25:50 AM NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN
1 1 A0000220 21 COMPLAINT OTHER 1J 21 COMPLAINT OTHER 1J 3668710.0 533007.0 01/01/2020 12:00:42 AM 01/01/2020 12:00:42 AM 01/01/2020 12:00:42 AM 01/01/2020 01:37:16 AM NAT Necessary Action Taken Y 2U04 034XX Broadway St 70125.0 2 POINT (-90.10840522 29.95996774) NaN NaN
2 2 A2190820 22A AREA CHECK 1K 22A AREA CHECK 1K 3682445.0 530709.0 01/17/2020 09:18:41 PM 01/17/2020 09:18:41 PM 01/17/2020 09:18:47 PM 01/17/2020 09:18:54 PM NAT Necessary Action Taken N 8B02 N Peters St & Bienville St 70130.0 8 POINT (-90.065113 29.95323762) NaN NaN
3 3 A2874820 21 COMPLAINT OTHER 2A 21 COMPLAINT OTHER 1J 3737616.0 590067.0 01/23/2020 10:19:48 AM 01/23/2020 10:22:05 AM 01/23/2020 10:31:11 AM 01/23/2020 10:34:35 AM GOA GONE ON ARRIVAL N 7L08 I-10 E 70129.0 7 POINT (-89.88854843 30.11465463) NaN NaN
4 4 A2029120 34S AGGRAVATED BATTERY BY SHOOTING 2C 34S AGGRAVATED BATTERY BY SHOOTING 2C 3696210.0 551411.0 01/16/2020 05:09:05 PM 01/16/2020 05:09:43 PM 01/16/2020 05:16:07 PM 01/16/2020 10:49:37 PM RTF REPORT TO FOLLOW N 7A01 Chef Menteur Hwy & Downman Rd 70126.0 7 POINT (-90.02090137 30.00973449) NaN NaN
In [ ]:
# size of calls_for_service
calls_for_service.shape
Out[ ]:
(5109233, 24)

By examining the TypeText attribute, we can see a few of the unique incident types reported in the dataset.

In [ ]:
# list first 50 unique type labels for NOPD incidents
calls_for_service["TypeText"].unique()[:50]
Out[ ]:
array(['AREA CHECK', 'COMPLAINT OTHER', 'AGGRAVATED BATTERY BY SHOOTING',
       'AUTO ACCIDENT', 'RECOVERY OF REPORTED STOLEN VEHICLE',
       'DISTURBANCE (OTHER)', 'SHOPLIFTING', 'BICYCLE THEFT', 'HIT & RUN',
       'TRAFFIC STOP', 'BURGLAR ALARM, SILENT', 'DISCHARGING FIREARM',
       'SIMPLE BURGLARY VEHICLE', 'MEDICAL', 'SUSPICIOUS PERSON',
       'DOMESTIC DISTURBANCE', 'FIREWORKS', 'MENTAL PATIENT',
       'SUICIDE THREAT', 'PROWLER', 'FIGHT', 'THEFT',
       'SIMPLE CRIMINAL DAMAGE', 'EXTORTION (THREATS)', 'THEFT BY FRAUD',
       'SIMPLE BATTERY', 'RESIDENCE BURGLARY', 'HOMICIDE BY SHOOTING',
       'MISSING JUVENILE', 'RETURN FOR ADDITIONAL INFO',
       'UNAUTHORIZED USE OF VEHICLE', 'LOST PROPERTY',
       'VIOLATION OF PROTECTION ORDER', 'PUBLIC GATHERING',
       'AGGRAVATED RAPE', 'UNCLASSIFIED DEATH',
       'AGGRAVATED ASSAULT DOMESTIC', 'AUTO THEFT', 'TRAFFIC INCIDENT',
       'SIMPLE BATTERY DOMESTIC', 'DRUG VIOLATIONS',
       'SIMPLE ASSAULT DOMESTIC', 'THEFT FROM EXTERIOR OF VEHICLE',
       'ILLEGAL EVICTION', 'SIMPLE BURGLARY', 'ARMED ROBBERY WITH KNIFE',
       'ARMED ROBBERY WITH GUN', 'NOISE COMPLAINT',
       'AGGRAVATED BATTERY BY CUTTING', 'AUTO ACCIDENT WITH INJURY'],
      dtype=object)

Cleaning Calls for Service Dataframe¶

Now that the NOPD data is loaded, it has to be cleaned slightly. First, we need to guarantee that the types of data are loaded correctly - let's examine what Pandas loaded for us, and see what needs to be changed.

In [ ]:
# get all data types for calls_for_service
calls_for_service.dtypes
Out[ ]:
Unnamed: 0           int64
NOPD_Item           object
Type                object
TypeText            object
Priority            object
InitialType         object
InitialTypeText     object
InitialPriority     object
MapX               float64
MapY               float64
TimeCreate          object
TimeDispatch        object
TimeArrive          object
TimeClosed          object
Disposition         object
DispositionText     object
SelfInitiated       object
Beat                object
BLOCK_ADDRESS       object
Zip                float64
PoliceDistrict       int64
Location            object
Type_               object
TimeArrival         object
dtype: object

While most of the data categories are indeed objects, so the default setting worked correclty, there are a few categories we must change. We have to convert ZIP code to a categorical object (adding two zip codes doesn't make sense), as well as translating the time related attributes to Pandas datetime objects.

In [ ]:
# convert ZIP column to object
calls_for_service['Zip'] = calls_for_service['Zip'].astype(str)
In [ ]:
# convert temporal attributes to datetime objects
calls_for_service['TimeCreate'] = pd.to_datetime(calls_for_service['TimeCreate'])
calls_for_service['TimeDispatch'] = pd.to_datetime(calls_for_service['TimeDispatch'])
calls_for_service['TimeArrive'] = pd.to_datetime(calls_for_service['TimeArrive'])
calls_for_service['TimeClosed'] = pd.to_datetime(calls_for_service['TimeClosed'])
In [ ]:
# drop junk index generated during reading
calls_for_service.drop(['Unnamed: 0'], axis =1, inplace = True)

Location Extraction¶

Here, we must extract the proper locations of each service call. This will allow us to match them to the proper weather station in the NOAA dataframe.

In [ ]:
calls_for_service[["Longitude", "Latitude"]] = calls_for_service["Location"].str.extract(r'POINT \((-?\d+\.\d+) (-?\d+\.\d+)\)')
In [ ]:
calls_for_service["Longitude"] = calls_for_service["Longitude"].astype(float)
calls_for_service["Latitude"] = calls_for_service["Latitude"].astype(float)
In [ ]:
calls_for_service.head()
Out[ ]:
NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival Longitude Latitude
0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN -90.045256 29.947510
1 A0000220 21 COMPLAINT OTHER 1J 21 COMPLAINT OTHER 1J 3668710.0 533007.0 2020-01-01 00:00:42 2020-01-01 00:00:42 2020-01-01 00:00:42 2020-01-01 01:37:16 NAT Necessary Action Taken Y 2U04 034XX Broadway St 70125.0 2 POINT (-90.10840522 29.95996774) NaN NaN -90.108405 29.959968
2 A2190820 22A AREA CHECK 1K 22A AREA CHECK 1K 3682445.0 530709.0 2020-01-17 21:18:41 2020-01-17 21:18:41 2020-01-17 21:18:47 2020-01-17 21:18:54 NAT Necessary Action Taken N 8B02 N Peters St & Bienville St 70130.0 8 POINT (-90.065113 29.95323762) NaN NaN -90.065113 29.953238
3 A2874820 21 COMPLAINT OTHER 2A 21 COMPLAINT OTHER 1J 3737616.0 590067.0 2020-01-23 10:19:48 2020-01-23 10:22:05 2020-01-23 10:31:11 2020-01-23 10:34:35 GOA GONE ON ARRIVAL N 7L08 I-10 E 70129.0 7 POINT (-89.88854843 30.11465463) NaN NaN -89.888548 30.114655
4 A2029120 34S AGGRAVATED BATTERY BY SHOOTING 2C 34S AGGRAVATED BATTERY BY SHOOTING 2C 3696210.0 551411.0 2020-01-16 17:09:05 2020-01-16 17:09:43 2020-01-16 17:16:07 2020-01-16 22:49:37 RTF REPORT TO FOLLOW N 7A01 Chef Menteur Hwy & Downman Rd 70126.0 7 POINT (-90.02090137 30.00973449) NaN NaN -90.020901 30.009734

Getting Rid of Duplicates and False Calls¶

Obviously not every call to the police turns up a crime or results in any action. Therefore, it is important we do our best to drop all the unrepresentative calls. Let's take a look at the DispositionText column which informs us the results of each call and the Disposition column which is the abbreviatied version of the previous column.

In [ ]:
calls_for_service["DispositionText"].value_counts()
Out[ ]:
DispositionText
Necessary Action Taken                2511179
REPORT TO FOLLOW                       979928
GONE ON ARRIVAL                        549778
NECESSARY ACTION TAKEN                 527679
VOID                                   236148
UNFOUNDED                              167989
DUPLICATE                              134592
MUNICIPAL NECESSARY ACTION TAK            422
Test incident                             253
Test Incident                             233
Canceled By Complainant                   226
REFERRED TO EXTERNAL AGENCY               185
RTA Related Incident Disposition          130
TRUANCY NECESSARY ACTION TAKEN            129
FALSE ALARM                               100
UNKNOWN                                    81
Clear                                      46
SUPPLEMENTAL                               28
Sobering Center Transport                  28
CURFEW NECESSARY ACTION TAKEN               6
REPORT TO FOLLOW MUNICIPAL                  4
REPORT TO FOLLOW CURFEW                     4
TEST MOTOROLA                               3
REPORT TO FOLLOW TRUANCY                    2
CREATED ON SYS DOWN/RESEARCH                1
Report written incident UnFounded           1
Name: count, dtype: int64
In [ ]:
calls_for_service["Disposition"].value_counts()
Out[ ]:
Disposition
NAT      3038859
RTF       979928
GOA       549777
VOI       236148
UNF       167989
DUP       134592
NATM         422
EST          253
TST          233
CBC          226
REF          185
TRN          130
NATT         129
FAR          100
NO911         50
-13           43
SBC           28
SUPP          28
FDINF         17
NODIS         12
NATC           6
RTFC           4
RTFM           4
CLR            3
TEST           3
RTFT           2
MD/PM          1
1              1
NAT67          1
NAT18          1
NAT71          1
OFFLN          1
RUF            1
Name: count, dtype: int64

The two columns represent the same text, but the DispositionText column is more general and will be easier to work with. Let's now remove any duplicate, void, or false alarm calls along with any where the subject was gone on arrival utilizing the DispositionText. While it might seem intuitive to instantly remove "unfounded" calls, they represent a call where a charge was not given, but an incident still took place - as such, we leave them in our dataset.

In [ ]:
disposition_mask = "GONE ON ARRIVAL|VOID|FALSE ALARM|Clear|DUPLICATE|Test incident|Test Incident|Canceled By Complainant"
calls_for_service = calls_for_service[calls_for_service["DispositionText"].str.contains(disposition_mask)==False]

Successful, as seen below

In [ ]:
calls_for_service["DispositionText"].value_counts()
Out[ ]:
DispositionText
Necessary Action Taken                2511179
REPORT TO FOLLOW                       979928
NECESSARY ACTION TAKEN                 527679
UNFOUNDED                              167989
MUNICIPAL NECESSARY ACTION TAK            422
REFERRED TO EXTERNAL AGENCY               185
RTA Related Incident Disposition          130
TRUANCY NECESSARY ACTION TAKEN            129
UNKNOWN                                    81
SUPPLEMENTAL                               28
Sobering Center Transport                  28
CURFEW NECESSARY ACTION TAKEN               6
REPORT TO FOLLOW MUNICIPAL                  4
REPORT TO FOLLOW CURFEW                     4
TEST MOTOROLA                               3
REPORT TO FOLLOW TRUANCY                    2
Report written incident UnFounded           1
CREATED ON SYS DOWN/RESEARCH                1
Name: count, dtype: int64

Categorizing Call Types¶

The NOPD labels activity with a short string. However, there are 430 different labels, some of which are more specific than others. They also contain typos, multiple labels for the same concept, and events that are not of interest.

In [ ]:
calls_for_service["TypeText"].unique().shape[0]
Out[ ]:
430
In [ ]:
calls_for_service["TypeText"][0:10]
Out[ ]:
0                              AREA CHECK
1                         COMPLAINT OTHER
2                              AREA CHECK
4          AGGRAVATED BATTERY BY SHOOTING
5                           AUTO ACCIDENT
6     RECOVERY OF REPORTED STOLEN VEHICLE
10                              HIT & RUN
11                           TRAFFIC STOP
13                             AREA CHECK
15                             AREA CHECK
Name: TypeText, dtype: object

To simplify, we are going to create 19 different bins based on similar sudies, and add each category into one of these bins.

  1. Accidents/Traffic Safety
  2. Alarms
  3. Public Assistance
  4. Mental Health
  5. Complaints/Environment
  6. Domestic Violence
  7. Drugs
  8. Fire
  9. Alcohol
  10. Medical Emergencies
  11. Missing Persons
  12. Officer Needs Help
  13. Not Crime
  14. Other
  15. Property
  16. Sex Offenses
  17. Status
  18. Suspicion
  19. Violent Crime
  20. Warrants
In [ ]:
types = ['Accidents/Traffic Safety', 'Alarms', 'Public Assistance', 'Mental Health', 'Complaints/Environment', 'Domestic Violence',
        'Drugs','Fire','Alcohol','Medical Emergencies','Missing Persons','Officer Needs Help', 'Not Crime', 'Other', 'Property',
        'Sex Offenses', 'Status','Suspicion','Violent Crime','Warrants']

The mapping are contained in an excel file, where we mapped each crime's TypeText feature to one of 20 specific categories above. We can load this in from an excel file to a spreadsheet, and then use it as a map to create the broader categories.

In [ ]:
file_path = '../data/output_data.xlsx'
mapping_df = pd.read_excel(file_path)

mapping_dict = mapping_df.set_index('TypeText')['Index'].to_dict()
In [ ]:
calls_for_service.head()
Out[ ]:
NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival Longitude Latitude
0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN -90.045256 29.947510
1 A0000220 21 COMPLAINT OTHER 1J 21 COMPLAINT OTHER 1J 3668710.0 533007.0 2020-01-01 00:00:42 2020-01-01 00:00:42 2020-01-01 00:00:42 2020-01-01 01:37:16 NAT Necessary Action Taken Y 2U04 034XX Broadway St 70125.0 2 POINT (-90.10840522 29.95996774) NaN NaN -90.108405 29.959968
2 A2190820 22A AREA CHECK 1K 22A AREA CHECK 1K 3682445.0 530709.0 2020-01-17 21:18:41 2020-01-17 21:18:41 2020-01-17 21:18:47 2020-01-17 21:18:54 NAT Necessary Action Taken N 8B02 N Peters St & Bienville St 70130.0 8 POINT (-90.065113 29.95323762) NaN NaN -90.065113 29.953238
4 A2029120 34S AGGRAVATED BATTERY BY SHOOTING 2C 34S AGGRAVATED BATTERY BY SHOOTING 2C 3696210.0 551411.0 2020-01-16 17:09:05 2020-01-16 17:09:43 2020-01-16 17:16:07 2020-01-16 22:49:37 RTF REPORT TO FOLLOW N 7A01 Chef Menteur Hwy & Downman Rd 70126.0 7 POINT (-90.02090137 30.00973449) NaN NaN -90.020901 30.009734
5 A3444420 20 AUTO ACCIDENT 1E 20 AUTO ACCIDENT 1E 3666298.0 529693.0 2020-01-27 19:59:59 2020-01-27 20:02:05 2020-01-27 20:14:58 2020-01-27 21:19:56 RTF REPORT TO FOLLOW N 2L04 Broadway St & S Claiborne Ave 70125.0 2 POINT (-90.11613127 29.95092657) NaN NaN -90.116131 29.950927
In [ ]:
calls_for_service["SimpleType"] = calls_for_service['TypeText'].progress_apply(lambda x: types[mapping_dict[x]])

calls_for_service.head()

print(calls_for_service.iloc[352]["SimpleType"])
  0%|          | 0/4187799 [00:00<?, ?it/s]
Status

NOAA Weather Station Data¶


The second data requirement to answer this question is weather data from around New Orleans across time. The National Oceanic and Atmospheric Administration maintains the Climate Data Online Service, which allows for requests of historical data from federal weather stations across the country. Our target dataset is the Global Historical Cliamte Network Daily, which includes daily land surface observations around the world of temperature, precipitation, wind speed, and other attributes. While there is no direct way to immediately download the data from NOAA, they do allow for public requests via email which include a source from which to download the results of the query. Fortunately, since the data is formatted in a low memory fashion, the dataset is small enough for us to host for this project publicly and allow us to sidestep the request requirement while remaining inside the NOAA terms of service. To download the data specific to this project, they are available as a part of this project's repository HERE.

To make your own data request, one can be filed HERE.

GHCN-Daily Query

For this project, there are 11 different weather stations with archived data from Jan 1 2011 to the date of the query (Sept 29 2023).

GHCN-Station Maps

Some of the stations are renamings of existing stations, so there are eight unique locations that the weather data comes from. With this information, given a crime location and time, we can match it to the weather data from the nearest station with data at that point geographically, and establish what the weather was when it occurred.

In [ ]:
# read in weather data
weather = pd.read_csv('../data/weather/NCEI_CDO.csv', low_memory = False)

Cleaning Weather Dataframe¶

In [ ]:
weather.head()
Out[ ]:
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND AWND_ATTRIBUTES DAPR DAPR_ATTRIBUTES FMTM FMTM_ATTRIBUTES MDPR MDPR_ATTRIBUTES PGTM PGTM_ATTRIBUTES PRCP PRCP_ATTRIBUTES SNOW SNOW_ATTRIBUTES SNWD SNWD_ATTRIBUTES TAVG TAVG_ATTRIBUTES TMAX TMAX_ATTRIBUTES TMIN TMIN_ATTRIBUTES TOBS TOBS_ATTRIBUTES WDF2 WDF2_ATTRIBUTES WDF5 WDF5_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5 WSF5_ATTRIBUTES WT01 WT01_ATTRIBUTES WT02 WT02_ATTRIBUTES WT03 WT03_ATTRIBUTES WT04 WT04_ATTRIBUTES WT05 WT05_ATTRIBUTES WT06 WT06_ATTRIBUTES WT08 WT08_ATTRIBUTES WT10 WT10_ATTRIBUTES WT11 WT11_ATTRIBUTES WT13 WT13_ATTRIBUTES WT14 WT14_ATTRIBUTES WT16 WT16_ATTRIBUTES WT18 WT18_ATTRIBUTES WT21 WT21_ATTRIBUTES
0 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.03 ,,N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.04 ,,N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 T,,N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.50 ,,N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.59 ,,N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [ ]:
station_list = weather["STATION"].unique()
station_list
Out[ ]:
array(['US1LAOR0006', 'US1LAOR0016', 'USW00012916', 'US1LAOR0003',
       'US1LAOR0014', 'USC00166666', 'US1LAOR0012', 'USW00053917',
       'USW00012930', 'US1LAOR0009', 'US1LAOR0019'], dtype=object)
In [ ]:
weather.columns
Out[ ]:
Index(['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'AWND',
       'AWND_ATTRIBUTES', 'DAPR', 'DAPR_ATTRIBUTES', 'FMTM', 'FMTM_ATTRIBUTES',
       'MDPR', 'MDPR_ATTRIBUTES', 'PGTM', 'PGTM_ATTRIBUTES', 'PRCP',
       'PRCP_ATTRIBUTES', 'SNOW', 'SNOW_ATTRIBUTES', 'SNWD', 'SNWD_ATTRIBUTES',
       'TAVG', 'TAVG_ATTRIBUTES', 'TMAX', 'TMAX_ATTRIBUTES', 'TMIN',
       'TMIN_ATTRIBUTES', 'TOBS', 'TOBS_ATTRIBUTES', 'WDF2', 'WDF2_ATTRIBUTES',
       'WDF5', 'WDF5_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5',
       'WSF5_ATTRIBUTES', 'WT01', 'WT01_ATTRIBUTES', 'WT02', 'WT02_ATTRIBUTES',
       'WT03', 'WT03_ATTRIBUTES', 'WT04', 'WT04_ATTRIBUTES', 'WT05',
       'WT05_ATTRIBUTES', 'WT06', 'WT06_ATTRIBUTES', 'WT08', 'WT08_ATTRIBUTES',
       'WT10', 'WT10_ATTRIBUTES', 'WT11', 'WT11_ATTRIBUTES', 'WT13',
       'WT13_ATTRIBUTES', 'WT14', 'WT14_ATTRIBUTES', 'WT16', 'WT16_ATTRIBUTES',
       'WT18', 'WT18_ATTRIBUTES', 'WT21', 'WT21_ATTRIBUTES'],
      dtype='object')
In [ ]:
special_attribute_labels = {"WT01":"Fog","WT02":"Heavy Fog","WT03":"Thunder","WT04":"Ice Pellets", "WT05":"Hail", "WT06":"Rime", 
                            "WT07": "Dust", "WT08":"Smoke", "WT09":"Blowing Snow", "WT10":"Tornado", "WT11":"High Wind", "WT12":"Blowing Spray",
                            "WT13":"Mist", "WT14":"Drizzle", "WT15":"Freezing Drizzle", "WT16":"Rain", "WT17":"Freezing Rain", "WT18":"Snow", "WT19":"Unknown Precipitation",
                           "WT21":"Ground Fog", "WT22":"Ice Fog"}


for col in special_attribute_labels:
    if col in weather.columns:
        weather[col] = weather[col].notnull()
    attribute = col + "_ATTRIBUTES"
    if attribute in weather.columns:
        weather.drop(attribute, inplace = True, axis = 1)

weather.rename(columns = special_attribute_labels, inplace = True)
In [ ]:
weather.iloc[[2716]]
Out[ ]:
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND AWND_ATTRIBUTES DAPR DAPR_ATTRIBUTES FMTM FMTM_ATTRIBUTES MDPR MDPR_ATTRIBUTES PGTM PGTM_ATTRIBUTES PRCP PRCP_ATTRIBUTES SNOW SNOW_ATTRIBUTES SNWD SNWD_ATTRIBUTES TAVG TAVG_ATTRIBUTES TMAX TMAX_ATTRIBUTES TMIN TMIN_ATTRIBUTES TOBS TOBS_ATTRIBUTES WDF2 WDF2_ATTRIBUTES WDF5 WDF5_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5 WSF5_ATTRIBUTES Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
2716 USW00012916 NEW ORLEANS AIRPORT, LA US 29.99755 -90.27772 -1.0 2011-01-18 6.93 ,,W NaN NaN 1733.0 ,,X NaN NaN 1732.0 ,,W 0.95 ,,X,2400 NaN NaN NaN NaN NaN NaN 73.0 ,,X 45.0 ,,X NaN NaN 290.0 ,,X 290.0 ,,X 29.1 ,,X 38.9 ,,X True True True False False False True False False False False True False False
In [ ]:
weather["DATE"] = pd.to_datetime(weather["DATE"])
In [ ]:
general_attribute_labels = {"AWND":"AverageDailyWind", "DAPR":"NumDaysPrecipAvg", "FMTM":"FastestWindTime",
                      "MDPR":"MultidayPrecipTotal", "PGTM":"PeakGustTime", "PRCP":"Precipitation", "SNOW":"Snowfall",
                      "SNWD":"MinSoilTemp", "TAVG":"TimeAvgTemp", "TMAX":"TimeMaxTemp", "TMIN":"TimeMinTemp","TOBS":"TempAtObs", "WDF2":"2MinMaxWindDirection",
                      "WDF5":"5MinMaxWindDirection", "WSF2":"2MinMaxWindSpeed", "WSF5":"5MinMaxWindSpeed"}
                 
                      
for c, col in enumerate(general_attribute_labels):
    attribute = col + "_ATTRIBUTES"
    if attribute in weather.columns:
        weather.drop(attribute, inplace = True, axis = 1)
        
weather.rename(columns = general_attribute_labels, inplace = True)
    
    
decapitalize = {"STATION":"Station", "NAME":"Name", "LATITUDE":"Latitude", "LONGITUDE":"Longitude", "ELEVATION":"Elevation", "DATE":"Date"}

weather.rename(columns = decapitalize, inplace = True)

weather.head()
Out[ ]:
Station Name Latitude Longitude Elevation Date AverageDailyWind NumDaysPrecipAvg FastestWindTime MultidayPrecipTotal PeakGustTime Precipitation Snowfall MinSoilTemp TimeAvgTemp TimeMaxTemp TimeMinTemp TempAtObs 2MinMaxWindDirection 5MinMaxWindDirection 2MinMaxWindSpeed 5MinMaxWindSpeed Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
0 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-01 NaN NaN NaN NaN NaN 0.03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
1 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-02 NaN NaN NaN NaN NaN 0.04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
2 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-03 NaN NaN NaN NaN NaN 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
3 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-04 NaN NaN NaN NaN NaN 0.50 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
4 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 2015-02-05 NaN NaN NaN NaN NaN 0.59 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
In [ ]:
weather.set_index(["Date", "Station"], inplace = True)
weather.sort_index(ascending = False, inplace = True)

weather.head()
Out[ ]:
Name Latitude Longitude Elevation AverageDailyWind NumDaysPrecipAvg FastestWindTime MultidayPrecipTotal PeakGustTime Precipitation Snowfall MinSoilTemp TimeAvgTemp TimeMaxTemp TimeMinTemp TempAtObs 2MinMaxWindDirection 5MinMaxWindDirection 2MinMaxWindSpeed 5MinMaxWindSpeed Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
Date Station
2023-09-29 USW00012930 NEW ORLEANS AUDUBON, LA US 29.91660 -90.130200 6.1 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 85.0 73.0 73.0 NaN NaN NaN NaN False False False False False False False False False False False False False False
USW00012916 NEW ORLEANS AIRPORT, LA US 29.99755 -90.277720 -1.0 NaN NaN NaN NaN NaN NaN NaN NaN 78.0 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
US1LAOR0014 NEW ORLEANS 3.8 WSW, LA US 29.93772 -90.131310 2.1 NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
US1LAOR0009 NEW ORLEANS 5.0 N, LA US 30.01515 -90.065586 0.6 NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
2023-09-28 USW00053917 NEW ORLEANS LAKEFRONT AIRPORT, LA US 30.04934 -90.028990 0.9 15.43 NaN NaN NaN 1815.0 0.0 NaN NaN NaN 87.0 77.0 NaN 60.0 90.0 21.9 25.9 False False False False False False False False False False False False False False

EDA¶

Let's take a look at some firework related 911 calls. Before plotting, we would expect there to be an influx on certain days of the year (NYE, July 4). We can extract just the type of offense, and the time from the master dataframe.

In [ ]:
# create copy of dataframe type and time columns, and select only ones where the type includes FIRE BOMB, EXPLOSION, FIREWORKS, or ILLEGAL FIREWORKS
explosions_df = calls_for_service[["TypeText", "TimeCreate"]].copy()[calls_for_service["TypeText"].str.contains('FIRE BOMB|EXPLOSION|FIREWORKS|ILLEGAL FIREWORKS')]
In [ ]:
# get the number of each of these kind of incidents overall
explosions_df["TypeText"].value_counts()
Out[ ]:
TypeText
FIREWORKS            3290
ILLEGAL FIREWORKS      87
EXPLOSION              14
FIRE BOMB               1
Name: count, dtype: int64

From this data, 5249 incdents involved fireworks, 169 involved illegal fireworks, 42 involved explosions, and 1 involved a fire bomb. However, we want to see how many of each kind of incident occur on each day and month, independent of the time, or the year. We can do this by extracting just the month and day of the incident from the datetime object.

In [ ]:
# extract just the month and day from each incident
explosions_df["Date"] = explosions_df["TimeCreate"].dt.strftime('%m-%d')
explosions_df.head()
Out[ ]:
TypeText TimeCreate Date
26 FIREWORKS 2020-01-01 00:00:34 01-01
27 FIREWORKS 2020-01-01 00:01:05 01-01
2502 FIREWORKS 2020-01-01 00:03:46 01-01
2503 FIREWORKS 2020-01-01 00:03:52 01-01
4494 FIREWORKS 2020-04-20 20:22:27 04-20

Since we care about the date, the type of incident, and the quantity of ocurrence, let's group the data by the date of occurrence and then within by the type of incident. We can reindex by these two attributes and then examine their relationship to the quantity.

In [ ]:
# reindex over date, and then the kind of incident within
explosions_nested_df = pd.DataFrame(explosions_df.groupby(["Date", "TypeText"])["TypeText"].count())
explosions_nested_df.rename(columns = {"Date" : "Date", "TypeText": "TypeText", "TypeText": "Quantity"}, inplace = True)
explosions_nested_df.head()
Out[ ]:
Quantity
Date TypeText
01-01 FIREWORKS 460
ILLEGAL FIREWORKS 15
01-02 FIREWORKS 48
ILLEGAL FIREWORKS 1
01-03 FIREWORKS 25

Which day had the most incidents?

In [ ]:
print("Maximum Incidents on", explosions_nested_df["Quantity"].idxmax()[0])
print("Maximum Number of Incidents is", explosions_nested_df["Quantity"].max())
Maximum Incidents on 07-04
Maximum Number of Incidents is 726

Unsuprisingly, the most occur on the Fourth of July, which lines up with our expectations. Let's make a plot to examine the frequency of explosive incedents throughout the year, and see if there are other patterns to be seen.

In [ ]:
# create stacked bar plot of each kind of explosion-related incident for each day during the year over all of the data
ax = explosions_nested_df.unstack().plot(kind = "bar", stacked = True, figsize = (20,6))
xtick_interval = 30
ax.set_xticks(range(0, 365, xtick_interval));
ax.set_ylabel("Quantity")
ax.set_title("Explosion Related Incidents in New Orleans Throughout the Year from 2011-2023")
ax.legend(["Explosion", "Fire Bomb", "Fireworks", "Illegal Fireworks"]);
No description has been provided for this image

While there is a dramatic spike around the fourth of July, there is also noticeable additional activity around Christmas and New Years. Other than that, explosion related incedents are relatively rare throughout the rest of the year. While this pattern of activity is seasonal and shows a clear temporal pattern, it can be explained by the incidence of holidays much better than climate patterns thoughout the year.

Let's now examine incidents that have a direct causal relationship with the weather. We can query from the incident reports in a similar way any incidents that contain the word "FLOOD" in the label.

In [ ]:
# get all incidents that mention 'FLOOD'
floods_df = calls_for_service[["TypeText", "TimeCreate"]].copy()[calls_for_service["TypeText"].str.contains('FLOOD')]
floods_df["TypeText"].value_counts()
Out[ ]:
TypeText
FLOOD EVENT                     2898
FLOODED STREET                   121
FLOODED VEHICLE                   35
FLOODED VEHICLE (NOT MOVING)       1
Name: count, dtype: int64

There were 3231 flood related events, with 135 flooded streets, and 71 flooded vehicles. We can extract the date and month of each event, and get the quantity for each day in all years in the same way that we did for the explosion data.

In [ ]:
# extract month, day from datetime objects
floods_df["Date"] = floods_df["TimeCreate"].dt.strftime('%m-%d')
floods_df.head()
Out[ ]:
TypeText TimeCreate Date
14553 FLOOD EVENT 2020-05-23 21:26:31 05-23
35718 FLOOD EVENT 2020-05-15 00:45:38 05-15
35725 FLOOD EVENT 2020-05-15 00:49:22 05-15
35744 FLOOD EVENT 2020-05-15 01:08:16 05-15
35747 FLOOD EVENT 2020-05-15 01:12:54 05-15

Again, we count the number of each type of event by each day, and reindex over these attributes.

In [ ]:
# get quantity of each kind of flood-related events for any given date in a year
floods_nested_df = pd.DataFrame(floods_df.groupby(["Date", "TypeText"])["TypeText"].count())
floods_nested_df.rename(columns = {"Date" : "Date", "TypeText": "TypeText", "TypeText": "Quantity"}, inplace = True)
floods_nested_df.head()
Out[ ]:
Quantity
Date TypeText
01-04 FLOOD EVENT 1
01-07 FLOOD EVENT 4
01-10 FLOOD EVENT 5
01-12 FLOOD EVENT 1
01-23 FLOODED STREET 1
In [ ]:
# create stacked bar plot of each kind of flood-related incident for each day during the year over all of the data
ax = floods_nested_df.unstack().plot(kind = "bar", stacked = True, figsize = (20,6))
xtick_interval = 30
ax.set_xticks(range(0, len(floods_nested_df), xtick_interval));
ax.set_ylabel("Quantity")
ax.set_title("Flood Related Incidents in New Orleans Throughout the Year from 2011-2023")
ax.legend(["Flood Event", "Flooded Street", "Flooded Vehicle", "Flooded Vehicle (Not Moving)"]);
No description has been provided for this image

Obviously, the presence of flooding is clearly related to the presence of weather events. As rain increases throughout the summer, there are more days with a higher number of flood related events, even among the days with abnormally high quantities. However, within that broader trend, the existence of the days with a much larger quantity of incidents cannot be explained by seasonal changes alone. It is more likely that these spikes are caused by individual weather events such as storms that exist within the broader seasonal trends.

Data Matching¶

For a crime, given the date it occurred on, match it to weather from that day that is the closest geographical distance. We can do this by creating a custom function that takes in a row of the calls dataframe, extracts the date, latitude, and longitue of the entry, and then gets all weather from stations on that day. Of this list, we can then find the closest station by euclidean distance, and return the identifier for that station.

The calls for service dataframe is pretty large. Matching these entities will take awhile (>30 Minutes), even though we've vectorized our function - to save time, let's only run the match once, and then save it externally. We can then check if the table exists every time we run the cell, and if so, just load it and bypass generating the pairings more than once.

In [ ]:
# get rate sof null values for all attributes
(weather.isnull().mean() * 100)
Out[ ]:
Name                     0.000000
Latitude                 0.000000
Longitude                0.000000
Elevation                0.000000
AverageDailyWind        63.303377
NumDaysPrecipAvg        98.658376
FastestWindTime         98.705226
MultidayPrecipTotal     98.679671
PeakGustTime            87.720942
Precipitation            1.499212
Snowfall                80.361174
MinSoilTemp             99.872226
TimeAvgTemp             83.670514
TimeMaxTemp             43.017164
TimeMinTemp             43.064015
TempAtObs               81.042634
2MinMaxWindDirection    63.239491
5MinMaxWindDirection    63.367264
2MinMaxWindSpeed        63.239491
5MinMaxWindSpeed        63.367264
Fog                      0.000000
Heavy Fog                0.000000
Thunder                  0.000000
Ice Pellets              0.000000
Hail                     0.000000
Rime                     0.000000
Smoke                    0.000000
Tornado                  0.000000
High Wind                0.000000
Mist                     0.000000
Drizzle                  0.000000
Rain                     0.000000
Snow                     0.000000
Ground Fog               0.000000
dtype: float64
In [ ]:
#entity matching

# warning, can take >30 mins
def match_weather(crime_row):
    # extract date, latitude, and longitude
    c_date = crime_row["DateCreate"]
    c_lat = crime_row["Latitude"]
    c_long = crime_row["Longitude"]
    # try to find weather on that day
    
    try:
        weather_by_day = weather.loc[c_date]
    except KeyError:
        return np.nan
    
    # if weather exists, get closest station identifier
    euc_distances = np.sqrt((weather_by_day['Latitude'] - c_lat) ** 2 + (weather_by_day['Longitude'] - c_long) ** 2)
    closest_station = euc_distances.idxmin()
    
    return(closest_station)                                       
In [ ]:
match_table_path = '../data/match_table.csv'

calls_for_service["DateCreate"] = calls_for_service["TimeCreate"].dt.floor('D')

if os.path.exists(match_table_path):
    print("Loading Cached Entity Matching...")
    match_table = pd.read_csv(match_table_path)
    calls_for_service = calls_for_service.merge(match_table, on = "NOPD_Item", how = "outer")
    
else:
    print("Generating Entity Matching...")
    calls_for_service["ClosestStation"] = calls_for_service.progress_apply(match_weather, axis = 1)
    # If the file doesn't exist, save the DataFrame as a CSV
    match_table = calls_for_service[["NOPD_Item", "ClosestStation"]]
    match_table.to_csv(match_table_path, index=False)
    print("Dumping Relational Table to %s" %match_table_path)
Loading Cached Entity Matching...
In [ ]:
calls_for_service.head()
Out[ ]:
NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival Longitude Latitude SimpleType DateCreate PairedStation
0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN -90.045256 29.947510 Status 2020-01-28 US1LAOR0006
1 A0000220 21 COMPLAINT OTHER 1J 21 COMPLAINT OTHER 1J 3668710.0 533007.0 2020-01-01 00:00:42 2020-01-01 00:00:42 2020-01-01 00:00:42 2020-01-01 01:37:16 NAT Necessary Action Taken Y 2U04 034XX Broadway St 70125.0 2 POINT (-90.10840522 29.95996774) NaN NaN -90.108405 29.959968 Complaints/Environment 2020-01-01 USW00012930
2 A2190820 22A AREA CHECK 1K 22A AREA CHECK 1K 3682445.0 530709.0 2020-01-17 21:18:41 2020-01-17 21:18:41 2020-01-17 21:18:47 2020-01-17 21:18:54 NAT Necessary Action Taken N 8B02 N Peters St & Bienville St 70130.0 8 POINT (-90.065113 29.95323762) NaN NaN -90.065113 29.953238 Status 2020-01-17 US1LAOR0009
3 A2029120 34S AGGRAVATED BATTERY BY SHOOTING 2C 34S AGGRAVATED BATTERY BY SHOOTING 2C 3696210.0 551411.0 2020-01-16 17:09:05 2020-01-16 17:09:43 2020-01-16 17:16:07 2020-01-16 22:49:37 RTF REPORT TO FOLLOW N 7A01 Chef Menteur Hwy & Downman Rd 70126.0 7 POINT (-90.02090137 30.00973449) NaN NaN -90.020901 30.009734 Violent Crime 2020-01-16 USW00053917
4 A3444420 20 AUTO ACCIDENT 1E 20 AUTO ACCIDENT 1E 3666298.0 529693.0 2020-01-27 19:59:59 2020-01-27 20:02:05 2020-01-27 20:14:58 2020-01-27 21:19:56 RTF REPORT TO FOLLOW N 2L04 Broadway St & S Claiborne Ave 70125.0 2 POINT (-90.11613127 29.95092657) NaN NaN -90.116131 29.950927 Accidents/Traffic Safety 2020-01-27 US1LAOR0014

Now that we have the matched station information, the combination of a weather station and a day can uniquely identify an observation in either table. With this, we can finally merge the two dataframes on the combination of these two keys.

In [ ]:
# merge datasets together
calls_weather_master = pd.merge(calls_for_service, weather, left_on = ["DateCreate", "PairedStation"], right_on = ["Date", "Station"])
calls_weather_master.head()
Out[ ]:
NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival Longitude_x Latitude_x SimpleType DateCreate PairedStation Name Latitude_y Longitude_y Elevation AverageDailyWind NumDaysPrecipAvg FastestWindTime MultidayPrecipTotal PeakGustTime Precipitation Snowfall MinSoilTemp TimeAvgTemp TimeMaxTemp TimeMinTemp TempAtObs 2MinMaxWindDirection 5MinMaxWindDirection 2MinMaxWindSpeed 5MinMaxWindSpeed Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN -90.045256 29.947510 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
1 A3605320 18 TRAFFIC INCIDENT 1J 18 TRAFFIC INCIDENT 1J 3677293.0 536895.0 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-29 00:01:34 NAT Necessary Action Taken Y 1J03 026XX Saint Ann St 70119.0 1 POINT (-90.08116628 29.97040355) NaN NaN -90.081166 29.970404 Accidents/Traffic Safety 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
2 A3557120 58 RETURN FOR ADDITIONAL INFO 0A 58 RETURN FOR ADDITIONAL INFO 1I 3679778.0 526277.0 2020-01-28 16:28:00 2020-01-28 21:33:43 2020-01-28 21:33:48 2020-01-28 23:01:01 NAT Necessary Action Taken N 6E04 012XX Saint Charles Ave 70130.0 6 POINT (-90.07368951 29.94113385) NaN NaN -90.073690 29.941134 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
3 A3600220 58 RETURN FOR ADDITIONAL INFO 1I 58 RETURN FOR ADDITIONAL INFO 1I 3692978.0 529591.0 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:52:46 NAT Necessary Action Taken Y 4H02 024XX Sanctuary Dr 70114.0 4 POINT (-90.03189416 29.94983828) NaN NaN -90.031894 29.949838 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
4 A3583520 TS TRAFFIC STOP 1J TS TRAFFIC STOP 1J 3705091.0 512746.0 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:45:08 NAT Necessary Action Taken Y 4D05 057XX Tullis Dr 70131.0 4 POINT (-89.99426931 29.90313678) NaN NaN -89.994269 29.903137 Accidents/Traffic Safety 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False

Analysis¶


Now that we have our data loaded and cleaned, we can use it to investigate some of the following questions and concepts:

  • Effect of Precipitation on Average Quantity of Incident Types per Day
    • How does the presence of precipitation affect the number of each event type per day?
  • Distribution of Violent vs Non-Violent Incidents across Daily Maximum Temperature
    • Do violent crimes tend to happen at days with higher temperatures than nonviolent crimes?
    • Do violent crimes happen with days with more extreme temperatures at different rates across New Orleans?

Effect of Precipitation¶

First, one of the most elementary distinctions we can make is whether or not the precipitation on that day was 0 or greater than 0. How does this affect the average quantity of each type of incident on any given day?

In [ ]:
# segment data by precipitation being 0 or non-0
precip_data = calls_weather_master.loc[calls_weather_master["Precipitation"]>0]
no_precip_data = calls_weather_master.loc[calls_weather_master["Precipitation"] == 0]

# how many of each kind of incident were there on each day in each data set?
precip_data_counts = precip_data.groupby(by = ["DateCreate"])["SimpleType"].value_counts().to_frame()
no_precip_data_counts = no_precip_data.groupby(by = ["DateCreate"])["SimpleType"].value_counts().to_frame()

# in each dataset, on average, how many were of each type on each day?
avg_precip_data_counts = precip_data_counts.groupby(by = ["SimpleType"]).mean()
avg_no_precip_data_counts = no_precip_data_counts.groupby(by = ["SimpleType"]).mean()
 
# merge data back together
total_avg_precip_counts = pd.merge(avg_precip_data_counts, avg_no_precip_data_counts, on = "SimpleType", suffixes = ("_precip", "_noprecip"))
total_avg_precip_counts.columns = ["PrecipPresent", "PrecipNotPresent"]

total_avg_precip_counts.head()
Out[ ]:
PrecipPresent PrecipNotPresent
SimpleType
Accidents/Traffic Safety 49.928854 65.570949
Alarms 26.324538 34.521739
Alcohol 1.398551 1.350427
Complaints/Environment 119.301837 157.387823
Domestic Violence 17.045213 22.742857
In [ ]:
# melt data back out into duplicate rows, with presence of precipitation as indicator variable
melted_precip_counts = pd.melt(total_avg_precip_counts.reset_index(), id_vars = "SimpleType", value_vars = ["PrecipPresent", "PrecipNotPresent"], var_name = "Precip", value_name = "AvgCount")
melted_precip_counts.head()
Out[ ]:
SimpleType Precip AvgCount
0 Accidents/Traffic Safety PrecipPresent 49.928854
1 Alarms PrecipPresent 26.324538
2 Alcohol PrecipPresent 1.398551
3 Complaints/Environment PrecipPresent 119.301837
4 Domestic Violence PrecipPresent 17.045213
In [ ]:
# create barplot, segment by 'precip'
plt.figure(figsize = (20,6))
precip_diffs = sns.barplot(melted_precip_counts, x = "SimpleType", y = "AvgCount", hue = "Precip")
# rotate long xtick labels
for item in precip_diffs.get_xticklabels():
    item.set_rotation(45)
precip_diffs.set(xlabel = "Simple Type", ylabel = "Average Count Per Day", title = "Average Frequency of Event Categories, With and Without Precipitation")

plt.legend(title = "Precipitation Present on Day")
plt.show()
No description has been provided for this image

This plot is the average number of each kind of incident on days where there was no precipitation recorded, versus the average where it was. Many kinds of instances ("Missing Persons", "Not Crime", "Fire", "Drugs") seem to be both minimal and also indifferent relatively to the presence of precipitation on that day. However, many of the largest categories ("Property", "Status", "Complaints/Environment", and "Accidents/Traffic Safety") all show demonstrably lower rates on days where precipitation was present versus when it was not.

We can go further and run test statistics on the difference of these averages. We can calculate, using a T-Test, the change in standard deviation between the days with each kind of weather, and the confidence of the prediction that they are different.

First, let's take our data for each kind of weather, and completely drop the date attribute - we only care about a day where there were 54 property crimes, a day with 97 property crimes, etc., as an individual instance. The types of crime are all rolled together as well, but we will separate them out next.

In [ ]:
# drop date index
precip_data_raw = precip_data_counts.droplevel("DateCreate")
no_precip_data_raw = no_precip_data_counts.droplevel("DateCreate")

precip_data_raw.head(20)
Out[ ]:
count
SimpleType
Property 54
Complaints/Environment 53
Accidents/Traffic Safety 29
Status 25
Fire 22
Alarms 13
Domestic Violence 11
Violent Crime 11
Suspicion 6
Other 2
Drugs 2
Status 265
Complaints/Environment 195
Accidents/Traffic Safety 149
Property 97
Alarms 44
Violent Crime 30
Domestic Violence 25
Suspicion 17
Drugs 4

Now, let's run our T-tests over all of our different categories. First, we'll create a dataframe to store all of our results. Then, we will enumerate over all of the different categories (which can be found by selecting all of the unique values in the index of our dataframe) - select the data from the days with precipitation and no precipitation of each kind, and conduct a T-Test. Note that these are non-parametric statistics. We won't be directly calculating these values, but we will be numerically finding their average over many permutations of our data, often referred to as bootstrapping). While this is generallly weaker than directly calculating these statistics, this allows us relaxed assumptions about what these underlying distributions look like (namely, that they are normal). This will give us a T statistic, or the difference in standard deviations of the averages, and a p-value for the prediction (the chance that we arrived at the conclusion that they are different given that the averages were actually the same, or the chance that our null hypothesis is actually true). Then, we can store these results back into our dataframe, and compare the test statistics and confidence values for each category of incident.

In [ ]:
# create dataframe to store results
t_tests = pd.DataFrame(columns = ["SimpleType", "TVal", "PVal"])
t_tests.set_index(["SimpleType"], inplace = True)

# for each category of incident
for i, kind in enumerate(precip_data_raw.index.unique()):
    # subselect category from each dataset
    subset_precip = precip_data_raw.loc[kind]
    subset_noprecip = no_precip_data_raw.loc[kind]
    # do numerical 2-sided ttest between the two, 50,000 perumutations
    ttest = stats.ttest_ind(subset_precip, subset_noprecip, nan_policy = 'omit', alternative = 'two-sided', permutations = 50000)
    # store results in dictionary, append to dataframe
    tval, pval = ttest.statistic[0], ttest.pvalue[0]
    temp_dict = pd.DataFrame({"TVal": tval, "PVal": pval}, index = [kind])
    t_tests = pd.concat([t_tests, temp_dict], ignore_index = False)


display(t_tests)
TVal PVal
Property -13.718642 0.000020
Complaints/Environment -9.611537 0.000020
Accidents/Traffic Safety -9.314315 0.000020
Status -4.030017 0.000080
Fire 0.015970 0.988380
Alarms -8.525947 0.000020
Domestic Violence -10.065440 0.000020
Violent Crime -13.872960 0.000020
Suspicion -9.708032 0.000020
Other -7.481751 0.000020
Drugs -4.156151 0.000040
Alcohol 0.547275 0.597768
Sex Offenses -2.225532 0.026019
Warrants -0.125015 0.923842
Missing Persons -4.199223 0.000060
Not Crime -0.433415 0.675266
Officer Needs Help -4.012472 0.000100
Public Assistance -2.071430 0.040059
Medical Emergencies -1.813930 0.073459
Mental Health -6.251554 0.000020

Now, let's plot each of these values (annotated with the p values), and see what the test statistics tell us.

In [ ]:
plt.figure(figsize = (20,6))

# make barplot
t_hist = sns.barplot(data = t_tests, x = t_tests.index, y = "TVal", color = 'b')

# rotate long xtick labels
for item in t_hist.get_xticklabels():
    item.set_rotation(45)

# annotate bars with PVal
for i, value in enumerate(t_tests['TVal']):
    if value < 0:
        plt.text(i, value -0.01, round(t_tests.iloc[i]["PVal"], 6), ha='center', va='top')
    else:
        plt.text(i, value + 0.01, round(t_tests.iloc[i]["PVal"], 6), ha='center', va='bottom')

t_hist.set(xlabel = "Incident Type", ylabel = "T-Statistic for Difference of Means", 
           title = "T Statistic for Difference of Means in Incidents per Day with Precipitation Present vs. Not Present (with p-values, 50,000 permutations)");
No description has been provided for this image

This chart tells us that there is probably a significant difference in the average rate of each incident between days with and without precipitation. It can be interpreted as, for example, that the days with precipitation present have on average 14 standard deviations fewer "Property" incidents occur, with a probability of 2e-05 that we observed this given that the underlying averages were actually equal.

There are some interesting results here - namely that violent crime and property crime are observed at lower rates on days that precipitation is present. There are a few others that seem counterintuitive - first, that fires are reported at essentially equal rates between the two samples, despite the presence of precipitation. As well, traffic incidents are reported at a lower rate on days with precipitation by about 9 standard deviations, which goes against the intuition that people are worse drivers or get into more accidents in precipitation.

This gives us many options for modeling considerations -- since the presence of precipitation seems to have an effect on many of these rates, cumulative rates might potentially be able to predict precipitation, or precipitation might be able to predict future rates of incidents along columns that showed a more drastic change.

Effect of Temperature on the Rates of Violent Crime¶

One of the leading correlations found in prior research is that an increase in temperature generally corresponds to an increase in violent crime. Is this true in our dataset? We can investigate the distribution of violent and non-violent incidents across all temperatues, and see if violent crimes happen at a higher rate or lower rate in higher temperatures, or if they are the same.

Before we continue, there's an interesting wrinkle here. How much of our temperature data is empty?

In [ ]:
# get rate of nans in TimeMaxTemp column
weather["TimeMaxTemp"].isnull().sum() * 100 / len(weather["TimeMaxTemp"])
Out[ ]:
43.017164274458025

Around 43%. There's an underlying mechanism for this missing data though. Let's look at the count of temperature observations when grouped by station to see where the issues are:

In [ ]:
# get count of values by each station
weather.groupby(by = "Station")["TimeMaxTemp"].count()
Out[ ]:
Station
US1LAOR0003       0
US1LAOR0006       0
US1LAOR0009       0
US1LAOR0012       0
US1LAOR0014       0
US1LAOR0016       0
US1LAOR0019       0
USC00166666       0
USW00012916    4654
USW00012930    4126
USW00053917    4599
Name: TimeMaxTemp, dtype: int64

The missing value rate is because a large number of stations do not record temperature. Let's the station with the most temperature data, assuming maximum and minimum temperature for a day are homogeneous throughout the city, and use the data from those readings.

In [ ]:
# create dataframe for temperature data
temp_df = calls_for_service[["NOPD_Item", "Longitude", "Latitude", "DateCreate", "SimpleType"]].copy().dropna(how = 'any')
# select temperature data from station "USW00012930"
temp_subselect = weather.loc[weather.index.get_level_values(1) == "USW00012930"][["TimeAvgTemp", "TimeMaxTemp", "TimeMinTemp"]]

# merge dataframes
temp_df = pd.merge(temp_df, temp_subselect, left_on = "DateCreate", right_on = "Date", how = "outer").progress_apply(lambda x: x)
  0%|          | 0/8 [00:00<?, ?it/s]

For each incident, we now have a greater number of readings for maximum, average, and minimum temperature throughout the day.

In [ ]:
temp_df.head()
Out[ ]:
NOPD_Item Longitude Latitude DateCreate SimpleType TimeAvgTemp TimeMaxTemp TimeMinTemp
0 A3472220 -90.045256 29.947510 2020-01-28 Status NaN 56.0 45.0
1 A3553920 -90.108122 29.989703 2020-01-28 Status NaN 56.0 45.0
2 A3539420 -90.120693 29.955857 2020-01-28 Accidents/Traffic Safety NaN 56.0 45.0
3 A3574220 -90.097326 29.977234 2020-01-28 Status NaN 56.0 45.0
4 A3605320 -90.081166 29.970404 2020-01-28 Accidents/Traffic Safety NaN 56.0 45.0

Let's see what the rate is for incidents overall for each daily maximum temperature by creating a bar plot of the number of crimes for each daily maximum temperature. Seaborn will standardize our data for us, by setting stat = 'density', so that we get a distribution of the rates across all possible values.

In [ ]:
# plot density of all incidents given each temperature value
plt.figure(figsize = (20,6))
temp_hist = sns.histplot(data = temp_df, x = "TimeMaxTemp", stat = "density")
No description has been provided for this image

This seems to follow a reasonable distribution, similar to what intuitively temperatures in New Orleans themselves follow.

Let's now add an indicator for violent incidents, and see if these distributions vary by any amount.

In [ ]:
# create indicator IsViolent, turn into binary classification
temp_df["IsViolent"] = temp_df["SimpleType"] == "Violent Crime"
temp_df.head(10)
Out[ ]:
NOPD_Item Longitude Latitude DateCreate SimpleType TimeAvgTemp TimeMaxTemp TimeMinTemp IsViolent
0 A3472220 -90.045256 29.947510 2020-01-28 Status NaN 56.0 45.0 False
1 A3553920 -90.108122 29.989703 2020-01-28 Status NaN 56.0 45.0 False
2 A3539420 -90.120693 29.955857 2020-01-28 Accidents/Traffic Safety NaN 56.0 45.0 False
3 A3574220 -90.097326 29.977234 2020-01-28 Status NaN 56.0 45.0 False
4 A3605320 -90.081166 29.970404 2020-01-28 Accidents/Traffic Safety NaN 56.0 45.0 False
5 A3543420 -90.112553 29.931343 2020-01-28 Alarms NaN 56.0 45.0 False
6 A3532020 -90.108405 29.959968 2020-01-28 Complaints/Environment NaN 56.0 45.0 False
7 A3486920 -90.099057 29.980953 2020-01-28 Status NaN 56.0 45.0 False
8 A3557120 -90.073690 29.941134 2020-01-28 Status NaN 56.0 45.0 False
9 A3600220 -90.031894 29.949838 2020-01-28 Status NaN 56.0 45.0 False

There is a lot going on under the hood of these plots. We set the stat = 'density' as above, but since violent and nonviolent crimes happen separately, and we want to compare their relative rates, we need to standardize them separately. Thankfully, Seaborn will also do this for us, with the common_norm = False parameter. We set the hue of each distribution to the "IsViolent* parameter - the 'multiple' and 'element' parameters stack the distribtions on top of each other, and make them translucent so that we can more easily see the differences between the two.

In [ ]:
# make plots of violent and nonviolent distributions based on max day temperature, normalized separately so that they are comparable
plt.figure(figsize = (20,6))
viol_temp_hist = sns.histplot(data = temp_df, x = "TimeMaxTemp", hue = "IsViolent", 
                              stat = "density", multiple = "layer", common_norm = False, element = "step")
viol_temp_hist.set(xlabel = "Maximum Temperature on Day (Fahrenheit)", ylabel = "Porportion of Incidents", title = "Relative Proportions of Violent and Non-Violent Incidents Across Maximum Temperatures on Day");
No description has been provided for this image

From this plot, we can see that violent crimes do indeed happen at higher rates at higher daily maximum temperatures than nonviolent crimes (represented by the orange bars towards the right of the distribution being visible over the tops of the blue ones). However, what happens if we examine the daily minimum temperature?

In [ ]:
# make plots of violent and nonviolent distributions based on min day temperature, normalized separately so that they are comparable
plt.figure(figsize = (20,6))
viol_temp_hist = sns.histplot(data = temp_df, x = "TimeMinTemp", hue = "IsViolent", 
                              stat = "density", multiple = "layer", common_norm = False, element = "step")
viol_temp_hist.set(xlabel = "Minimum Temperature on Day (Fahrenheit)", ylabel = "Porportion of Incidents", title = "Relative Proportions of Violent and Non-Violent Incidents Across Minimum Temperatures on Day");
No description has been provided for this image

The axes for these two plots are going to be different - the spread of daily maximum temperatures is going to be higher than the spread of daily minimum temperatures overal. However, we can see that with an increase of daily minimum temperature (the lowest temperature of the day gets higher), that violent crime happens slightly more often when the minimum temperature is lower. This seems to indicate that violent crime increases overall when the extreme temperatures on either day expand.

Let's go back to the maximum temperature data, and directly address the claim that violent crimes tend to happen on days with higher temperatures (> 90F) than nonviolent crimes.

Next, we can do a statistical test for the difference of proportions of violent crimes with respect to whether or not the daily maximum temperature was >90F.

In [ ]:
# create indicator IsHot for if TimeMaxTemp was geq 90 degrees, turn into binary classification
temp_df["IsHot"] = temp_df["TimeMaxTemp"] >= 90.0
temp_df.head()
Out[ ]:
NOPD_Item Longitude Latitude DateCreate SimpleType TimeAvgTemp TimeMaxTemp TimeMinTemp IsViolent IsHot
0 A3472220 -90.045256 29.947510 2020-01-28 Status NaN 56.0 45.0 False False
1 A3553920 -90.108122 29.989703 2020-01-28 Status NaN 56.0 45.0 False False
2 A3539420 -90.120693 29.955857 2020-01-28 Accidents/Traffic Safety NaN 56.0 45.0 False False
3 A3574220 -90.097326 29.977234 2020-01-28 Status NaN 56.0 45.0 False False
4 A3605320 -90.081166 29.970404 2020-01-28 Accidents/Traffic Safety NaN 56.0 45.0 False False
In [ ]:
# get number of incidents for each combination of IsHot and IsViolent
temp_df_hot_gb = temp_df.groupby(by = ["IsHot", "IsViolent"])["NOPD_Item"].count()
temp_df_hot_gb
Out[ ]:
IsHot  IsViolent
False  False        776335
       True          29198
True   False        257354
       True          10537
Name: NOPD_Item, dtype: int64

Now, let's take the number of violent crimes, and the number of total crimes in each classification and run a difference of proportions test. This will give us a Z-value, and a P-Value that we can use to assess the magnitude and confidence of the difference. Thankfully, instead of having to do it by hand, there is a statsmodels proportions_ztest function we can load and use.

In [ ]:
# get counts of violent crimes in each classification
hot_c = temp_df_hot_gb[True, True]
not_hot_c = temp_df_hot_gb[False, True]

# get sample size of each classification
hot_n = temp_df_hot_gb[True, True] + temp_df_hot_gb[True, False]
not_hot_n = temp_df_hot_gb[False, True] + temp_df_hot_gb[False, False]

n_array = np.array([hot_n, not_hot_n])
c_array = np.array([hot_c, not_hot_c])

# perform difference of proportions test.
zval, pval = proportions_ztest(c_array, n_array, alternative = 'larger')

print("P-Value:", pval)
P-Value: 1.1556238891735672e-13

While the difference seemed visually acute on our graphs, given that the p value for the test is well below 0.05, given the large sample size - we can confidently say that the proportion of violent incidents is greater for days where the maximum temperature is greater than 90 degrees fahrenheit than for days where it lower than 90 degrees fahrenheit. Note critically that while this test allows us to claim that there is a difference, it does not allow us to make any claim about the magnitude or significance therein of the difference itself.

Locational Distribution of Violent Crimes on days with Extreme Maximum or Minimum Daily Temperature¶

If an increase in temperature does correspond to a greater prevalence of violent incidents, where do they happen geographically? Are certain areas of the city more or less vulnerable to violence based on extreme temperatures?

In [ ]:
# subselect violent incidents
violent_incidents = temp_df.loc[temp_df["SimpleType"] == 'Violent Crime']
violent_incidents.head()
Out[ ]:
NOPD_Item Longitude Latitude DateCreate SimpleType TimeAvgTemp TimeMaxTemp TimeMinTemp IsViolent IsHot
24 A3548720 -90.106546 29.969927 2020-01-28 Violent Crime NaN 56.0 45.0 True False
50 A3571420 -90.098663 29.973679 2020-01-28 Violent Crime NaN 56.0 45.0 True False
118 A3500320 -90.081004 29.938222 2020-01-28 Violent Crime NaN 56.0 45.0 True False
120 A3587520 -90.050751 30.001026 2020-01-28 Violent Crime NaN 56.0 45.0 True False
188 A3481720 -90.050159 30.002987 2020-01-28 Violent Crime NaN 56.0 45.0 True False

From earlier, we notice that only three of the stations actually take temperature readings. Let's take the mean values recorded for each one, and use it to set our definition for what 'extreme' temperature means.

In [ ]:
weather.groupby(by = "Station")["TimeMaxTemp"].mean()
Out[ ]:
Station
US1LAOR0003          NaN
US1LAOR0006          NaN
US1LAOR0009          NaN
US1LAOR0012          NaN
US1LAOR0014          NaN
US1LAOR0016          NaN
US1LAOR0019          NaN
USC00166666          NaN
USW00012916    79.902879
USW00012930    80.475036
USW00053917    78.820613
Name: TimeMaxTemp, dtype: float64

To continue, we need data that lacks empty values. Let's subselect our data again for where the temperature recorded must have some value, and get our overall average temperature for the entire dataset.

In [ ]:
violent_incidents = violent_incidents[violent_incidents["TimeMaxTemp"].isnull()==False]
violent_incidents["TimeMaxTemp"].mean()
Out[ ]:
80.46237456080483

We can have two kinds of extreme temperature: extremely low, and extremely hot. Let's use a simple definition of extreme: extremely high is in the fourth quartile, and extremely low is in the first quartile. We can split our data into two groups, and plot their location across New Orleans.

In [ ]:
max_temp_percentiles = np.percentile(violent_incidents['TimeMaxTemp'], [25, 75])
print("25th percentile of Maximum Temperature", max_temp_percentiles[0])
print("75th percentile of Maximum Temperature:", max_temp_percentiles[1])
25th percentile of Maximum Temperature 73.0
75th percentile of Maximum Temperature: 90.0

Graph the violent incidents below. We can then examine the locational distribution based on whether the temperature was very high or very low. To do our visualization, we will use the folium library, which is a simple map plotting library that can create interactive JavaScript elements inline within our notebook.

In [ ]:
# select only x,y,z, drop empty values
violent_incidents_clean = violent_incidents[["TimeMaxTemp", "Longitude", "Latitude"]].dropna(how = 'any')
# create folium map entity
combined_map = folium.Map(location=[30, -90], tiles="Cartodb dark_matter", zoom_start=12.8)

# add markers to map object
for index, row in violent_incidents_clean.iterrows():

    # generate marker for either high or low outlier
    if(row['TimeMaxTemp'] < max_temp_percentiles[0]):
        kw = {"color": "blue"}
        icon = folium.Icon(**kw)
        folium.CircleMarker([row['Latitude'], row['Longitude']], color='blue', radius=3, stroke=False,fill=True, fill_opacity=0.5 ,opacity=1).add_to(combined_map)
    elif(row['TimeMaxTemp'] > max_temp_percentiles[1]):
        kw = {"color": "green"}
        icon = folium.Icon(**kw)
        folium.CircleMarker([row['Latitude'], row['Longitude']], color='red', radius=3, stroke=False,fill=True, fill_opacity=0.3 ,opacity=1).add_to(combined_map)

combined_map
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This plot shows the occurrence of violent events when the temperature for that day is in the upper quartile of maximum temperature for that day (in red) and when it's in the lower quartile of maximum temperature for that day (in blue). While there are certainly clusters of points, notably around the French Quarter and downtown, which temperature extreme they fall in seems to be uniformly distributed across the city - which suggests that geographically the city is generally not sensitive to violence along extremes of temperature more or less in any given area. That is, violent crimes do not happen disproportionately in certain areas of the city given extremely high or low maximum day temperatures.

Modeling¶

Based on our analysis, there are two potential model concepts we could select going forwards:

First, a feature of immediate importance is how long an incident lasts, or the duration from the time the report is created until it is closed. Does the weather at that time affect how long an incident lasts? Given an incident and the weather at that time, can we predict how long an incident will last?

Second, given in our analysis that there was a strong difference in the rates of most incident types across precipitation levels, can we create a model to reconstruct that effect backwards? In other words, given an incident from the calls for service data, how well can we predict the precipitation at that point?

The following section is preliminary, but shows that there is promise for both models, given our current data and understanding.

Predicting Total Incident Time Given Weather and Incident¶

One of the biggest things the weather at the time of a police reported incident could affect is the actual response time of the NOPD, and the time it takes to resolve it. Given an incident and the weather at the time, can we predict the total length of the incident? That is, can we predict the difference between the "TimeClosed" and "TimeCreated" attributes, given only the weather and the things we would know about. an incident it's inception?

In [ ]:
calls_weather_master.head()
Out[ ]:
NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival Longitude_x Latitude_x SimpleType DateCreate PairedStation Name Latitude_y Longitude_y Elevation AverageDailyWind NumDaysPrecipAvg FastestWindTime MultidayPrecipTotal PeakGustTime Precipitation Snowfall MinSoilTemp TimeAvgTemp TimeMaxTemp TimeMinTemp TempAtObs 2MinMaxWindDirection 5MinMaxWindDirection 2MinMaxWindSpeed 5MinMaxWindSpeed Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN -90.045256 29.947510 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
1 A3605320 18 TRAFFIC INCIDENT 1J 18 TRAFFIC INCIDENT 1J 3677293.0 536895.0 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-29 00:01:34 NAT Necessary Action Taken Y 1J03 026XX Saint Ann St 70119.0 1 POINT (-90.08116628 29.97040355) NaN NaN -90.081166 29.970404 Accidents/Traffic Safety 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
2 A3557120 58 RETURN FOR ADDITIONAL INFO 0A 58 RETURN FOR ADDITIONAL INFO 1I 3679778.0 526277.0 2020-01-28 16:28:00 2020-01-28 21:33:43 2020-01-28 21:33:48 2020-01-28 23:01:01 NAT Necessary Action Taken N 6E04 012XX Saint Charles Ave 70130.0 6 POINT (-90.07368951 29.94113385) NaN NaN -90.073690 29.941134 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
3 A3600220 58 RETURN FOR ADDITIONAL INFO 1I 58 RETURN FOR ADDITIONAL INFO 1I 3692978.0 529591.0 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:52:46 NAT Necessary Action Taken Y 4H02 024XX Sanctuary Dr 70114.0 4 POINT (-90.03189416 29.94983828) NaN NaN -90.031894 29.949838 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
4 A3583520 TS TRAFFIC STOP 1J TS TRAFFIC STOP 1J 3705091.0 512746.0 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:45:08 NAT Necessary Action Taken Y 4D05 057XX Tullis Dr 70131.0 4 POINT (-89.99426931 29.90313678) NaN NaN -89.994269 29.903137 Accidents/Traffic Safety 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False

Let's select the data we want to actually use to train the model. Any text attributes, indices, and duplicate variables are dropped here, as well as attributes that are mostly empty. Since we are predicting total incident time, we have to not include any attributes that are filled in after the incident is resolved - specifically the disposition type for how the incident was resolved. We do keep the location of both the incident and nearest station, and all of the closest available weather data that we matched earlier.

In [ ]:
cwm_ml = calls_weather_master.copy()
cwm_ml.drop(["NOPD_Item", "FastestWindTime", "MinSoilTemp", "Type_", "PeakGustTime", "TimeAvgTemp", "MultidayPrecipTotal", "NumDaysPrecipAvg", "TempAtObs", 
             "AverageDailyWind", "5MinMaxWindSpeed", "5MinMaxWindDirection", "2MinMaxWindDirection", "2MinMaxWindSpeed", "TimeMinTemp", "TimeMaxTemp", "Snowfall", "Disposition", "DispositionText",
            "MapX", "MapY", "Location", "PairedStation", "Type", "TypeText", "InitialType", "InitialTypeText", "Name", "BLOCK_ADDRESS", "DateCreate", "TimeArrival"], axis = 1, inplace = True) 


cwm_ml.head()
Out[ ]:
Priority InitialPriority TimeCreate TimeDispatch TimeArrive TimeClosed SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation Precipitation Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
0 1K 1K 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
1 1J 1J 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-29 00:01:34 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
2 0A 1I 2020-01-28 16:28:00 2020-01-28 21:33:43 2020-01-28 21:33:48 2020-01-28 23:01:01 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
3 1I 1I 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:52:46 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
4 1J 1J 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:45:08 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False

We create our target variable, "IncidentDuration" as the time difference between the time an incident was created and closed.

In [ ]:
cwm_ml["IncidentDuration"] = cwm_ml["TimeClosed"] - cwm_ml["TimeCreate"]
cwm_ml.head()
Out[ ]:
Priority InitialPriority TimeCreate TimeDispatch TimeArrive TimeClosed SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation Precipitation Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog IncidentDuration
0 1K 1K 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:48:30
1 1J 1J 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-29 00:01:34 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:20:51
2 0A 1I 2020-01-28 16:28:00 2020-01-28 21:33:43 2020-01-28 21:33:48 2020-01-28 23:01:01 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 06:33:01
3 1I 1I 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:52:46 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:29:22
4 1J 1J 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:45:08 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:11:31

We want to store the time created attribute, but a model is going to expect a numerical value. We can translate it into epoch seconds as a continuous value that our model will accept, and then remove all of the other time-related values that we are either not using or would be proxies for our target attribute.

In [ ]:
cwm_ml.drop(["TimeDispatch", "TimeArrive", "TimeClosed"], axis = 1, inplace = True)
cwm_ml['TimeCreate'] = (cwm_ml['TimeCreate'] - pd.Timestamp("1970-01-01")) // pd.Timedelta(seconds=1)
cwm_ml.head()
Out[ ]:
Priority InitialPriority TimeCreate SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation Precipitation Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog IncidentDuration
0 1K 1K 1580175440 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:48:30
1 1J 1J 1580254843 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:20:51
2 0A 1I 1580228880 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 06:33:01
3 1I 1I 1580250204 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:29:22
4 1J 1J 1580240017 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 0 days 00:11:31

We also need to convert our target attribute into a continuous value, but we can do that with a built-in method.

In [ ]:
cwm_ml["IncidentDuration"] = cwm_ml["IncidentDuration"].apply(lambda x: x.total_seconds())
cwm_ml.head()
Out[ ]:
Priority InitialPriority TimeCreate SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation Precipitation Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog IncidentDuration
0 1K 1K 1580175440 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 2910.0
1 1J 1J 1580254843 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 1251.0
2 0A 1I 1580228880 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 23581.0
3 1I 1I 1580250204 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 1762.0
4 1J 1J 1580240017 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False 691.0

Now, we need to separate our training data from our label, which by convention we generally call X and y respectively.

In [ ]:
y = cwm_ml["IncidentDuration"]
X = cwm_ml.drop(["IncidentDuration"], axis = 1)
In [ ]:
X.head()
Out[ ]:
Priority InitialPriority TimeCreate SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation Precipitation Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
0 1K 1K 1580175440 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
1 1J 1J 1580254843 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
2 0A 1I 1580228880 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
3 1I 1I 1580250204 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False
4 1J 1J 1580240017 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False

Our data has many categorical variables - but machine learning models generally expect a list of binary variables, instead of multi-class categorical variables. Thankfully, pandas will process our data for us quickly using the get_dummies function, which will go in and replace all of our categorical variables with a one-hot encoding acceptable for training a model.

In [ ]:
X = pd.get_dummies(X)
X.head()
Out[ ]:
TimeCreate PoliceDistrict Longitude_x Latitude_x Latitude_y Longitude_y Elevation Precipitation Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog Priority_0 Priority_0A Priority_0B Priority_0C Priority_0D Priority_0E Priority_0F Priority_0G Priority_0H Priority_0I Priority_0U Priority_0V Priority_0W Priority_0X Priority_0Y Priority_0Z Priority_1 Priority_1A Priority_1B Priority_1C Priority_1D Priority_1E Priority_1F Priority_1G Priority_1H Priority_1I Priority_1J Priority_1K Priority_1L Priority_1M Priority_1N Priority_1P Priority_1R Priority_1S Priority_1U Priority_1V Priority_1W Priority_1X Priority_1Y Priority_1Z Priority_2 Priority_2A Priority_2B Priority_2C Priority_2D Priority_2E Priority_2F Priority_2G Priority_2H Priority_2J Priority_2P Priority_2Q Priority_3 Priority_3A Priority_3C Priority_4A InitialPriority_0 InitialPriority_0A InitialPriority_0B InitialPriority_0C InitialPriority_0D InitialPriority_0E InitialPriority_0F InitialPriority_0G InitialPriority_0H InitialPriority_0I InitialPriority_0R InitialPriority_0Z InitialPriority_1 InitialPriority_1A InitialPriority_1B InitialPriority_1C InitialPriority_1D InitialPriority_1E InitialPriority_1F InitialPriority_1G InitialPriority_1H InitialPriority_1I InitialPriority_1J InitialPriority_1K InitialPriority_1Y InitialPriority_1Z InitialPriority_2 InitialPriority_2A InitialPriority_2B InitialPriority_2C InitialPriority_2D InitialPriority_2E InitialPriority_2F InitialPriority_2G InitialPriority_2H InitialPriority_2J InitialPriority_2Q InitialPriority_3A InitialPriority_3C InitialPriority_` SelfInitiated_N SelfInitiated_Y Beat_1A01 Beat_1B01 Beat_1C01 Beat_1C02 Beat_1C03 Beat_1C04 Beat_1E01 Beat_1E02 Beat_1E03 Beat_1E04 Beat_1E05 Beat_1F01 Beat_1F02 Beat_1G01 Beat_1G02 Beat_1H01 Beat_1H02 Beat_1I01 Beat_1I02 Beat_1I03 Beat_1I04 Beat_1J01 Beat_1J02 Beat_1J03 Beat_1J04 Beat_1K01 Beat_1K02 Beat_1L01 Beat_1L02 Beat_1L03 Beat_1L04 Beat_1M01 Beat_1M02 Beat_1M03 Beat_1M04 Beat_1M05 Beat_1N01 Beat_1N02 Beat_1N03 Beat_2A01 Beat_2A02 Beat_2A03 Beat_2A04 Beat_2A05 Beat_2B01 Beat_2B02 Beat_2B03 Beat_2B04 Beat_2B05 Beat_2D01 Beat_2D02 Beat_2D03 Beat_2E01 Beat_2E02 Beat_2E03 Beat_2E04 Beat_2F01 Beat_2F02 Beat_2F03 Beat_2F04 Beat_2H01 Beat_2H02 Beat_2H03 Beat_2H04 Beat_2H05 Beat_2I01 Beat_2I02 Beat_2I03 Beat_2K01 Beat_2K02 Beat_2K03 Beat_2K04 Beat_2L01 Beat_2L02 Beat_2L03 Beat_2L04 Beat_2L05 Beat_2M01 Beat_2M02 Beat_2M03 Beat_2M04 Beat_2M05 Beat_2M06 Beat_2N01 Beat_2N02 Beat_2N03 Beat_2O01 Beat_2P01 Beat_2P02 Beat_2Q01 Beat_2Q02 Beat_2Q03 Beat_2R01 Beat_2S01 Beat_2S02 Beat_2S03 Beat_2T01 Beat_2T02 Beat_2T03 Beat_2T04 Beat_2T05 Beat_2T06 Beat_2U01 Beat_2U02 Beat_2U03 Beat_2U04 Beat_2V01 Beat_2V02 Beat_2V03 Beat_3A01 Beat_3A02 Beat_3A03 Beat_3B01 Beat_3B02 Beat_3B03 Beat_3B04 Beat_3C01 Beat_3C02 Beat_3C03 Beat_3D01 Beat_3D02 Beat_3E01 Beat_3E02 Beat_3F01 Beat_3F02 Beat_3G01 Beat_3G02 Beat_3H01 Beat_3H02 Beat_3I01 Beat_3I02 Beat_3I03 Beat_3I04 Beat_3I05 Beat_3I06 Beat_3J01 Beat_3J02 Beat_3J03 Beat_3J04 Beat_3K01 Beat_3K02 Beat_3K03 Beat_3L01 Beat_3L02 Beat_3L03 Beat_3L04 Beat_3M01 Beat_3M02 Beat_3M03 Beat_3M04 Beat_3M05 Beat_3N01 Beat_3N02 Beat_3O01 Beat_3O02 Beat_3P01 Beat_3P02 Beat_3P03 Beat_3P04 Beat_3P05 Beat_3Q01 Beat_3Q02 Beat_3Q03 Beat_3Q04 Beat_3R01 Beat_3R02 Beat_3R03 Beat_3R04 Beat_3S01 Beat_3S02 Beat_3S03 Beat_3S04 Beat_3S05 Beat_3S06 Beat_3S07 Beat_3S08 Beat_3S09 Beat_3S10 Beat_3T01 Beat_3T02 Beat_3U01 Beat_3U02 Beat_3U03 Beat_3U04 Beat_3V01 Beat_3V02 Beat_3V03 Beat_3X01 Beat_3X02 Beat_3X03 Beat_3Y01 Beat_3Y02 Beat_3Y03 Beat_3Y04 Beat_4A01 Beat_4A02 Beat_4A03 Beat_4A04 Beat_4A05 Beat_4A06 Beat_4B01 Beat_4B02 Beat_4B03 Beat_4B04 Beat_4B05 Beat_4C01 Beat_4C02 Beat_4C03 Beat_4D01 Beat_4D02 Beat_4D03 Beat_4D04 Beat_4D05 Beat_4E01 Beat_4E02 Beat_4E03 Beat_4E04 Beat_4E05 Beat_4F01 Beat_4F02 Beat_4G01 Beat_4G02 Beat_4G03 Beat_4G04 Beat_4H01 Beat_4H02 Beat_4I01 Beat_4I02 Beat_4I03 Beat_4I04 Beat_4J01 Beat_4J02 Beat_4K01 Beat_4K02 Beat_4K03 Beat_5A01 Beat_5A02 Beat_5A03 Beat_5A04 Beat_5A05 Beat_5B01 Beat_5B02 Beat_5B03 Beat_5C01 Beat_5C02 Beat_5C03 Beat_5C04 Beat_5D01 Beat_5D02 Beat_5D03 Beat_5D04 Beat_5E01 Beat_5E02 Beat_5E03 Beat_5E04 Beat_5G01 Beat_5G02 Beat_5G03 Beat_5G04 Beat_5H01 Beat_5H02 Beat_5H03 Beat_5H04 Beat_5H05 Beat_5I01 Beat_5I02 Beat_5I03 Beat_5I04 Beat_5I05 Beat_5I06 Beat_5K01 Beat_5K02 Beat_5K03 Beat_5K04 Beat_5L01 Beat_5L02 Beat_5L03 Beat_5M01 Beat_5M02 Beat_5M03 Beat_5M04 Beat_5Q01 Beat_5Q02 Beat_5Q03 Beat_5Q04 Beat_5R01 Beat_6A01 Beat_6A02 Beat_6A03 Beat_6A04 Beat_6A05 Beat_6A06 Beat_6B01 Beat_6B02 Beat_6B03 Beat_6B04 Beat_6B05 Beat_6B06 Beat_6B07 Beat_6C01 Beat_6D01 Beat_6D02 Beat_6D03 Beat_6E01 Beat_6E02 Beat_6E03 Beat_6E04 Beat_6F01 Beat_6F02 Beat_6F03 Beat_6F04 Beat_6F05 Beat_6F06 Beat_6G01 Beat_6G02 Beat_6G03 Beat_6G04 Beat_6H01 Beat_6H02 Beat_6H03 Beat_6H04 Beat_6H05 Beat_6H06 Beat_6I01 Beat_6I02 Beat_6I03 Beat_6I04 Beat_6J01 Beat_6J02 Beat_6J03 Beat_6J04 Beat_6J05 Beat_6K01 Beat_6K02 Beat_6L01 Beat_6L02 Beat_6M01 Beat_6M02 Beat_6M03 Beat_6M04 Beat_6M05 Beat_6N01 Beat_6N02 Beat_6O01 Beat_6O02 Beat_6O03 Beat_6P01 Beat_6Q01 Beat_6Q02 Beat_6Q03 Beat_6Q04 Beat_6Q05 Beat_7A01 Beat_7A02 Beat_7A03 Beat_7A04 Beat_7A05 Beat_7A06 Beat_7B01 Beat_7B03 Beat_7B04 Beat_7C01 Beat_7C02 Beat_7C03 Beat_7C04 Beat_7C05 Beat_7D01 Beat_7D02 Beat_7D03 Beat_7D04 Beat_7D05 Beat_7D06 Beat_7D07 Beat_7D08 Beat_7E01 Beat_7F01 Beat_7F02 Beat_7F03 Beat_7F04 Beat_7G01 Beat_7G02 Beat_7G03 Beat_7G04 Beat_7G05 Beat_7G06 Beat_7H01 Beat_7H02 Beat_7H03 Beat_7I01 Beat_7I02 Beat_7I03 Beat_7I04 Beat_7I05 Beat_7I06 Beat_7I07 Beat_7I08 Beat_7I09 Beat_7I10 Beat_7I11 Beat_7I13 Beat_7I14 Beat_7J01 Beat_7J02 Beat_7J03 Beat_7J04 Beat_7K01 Beat_7K02 Beat_7K03 Beat_7K04 Beat_7K05 Beat_7K06 Beat_7K07 Beat_7L01 Beat_7L02 Beat_7L03 Beat_7L04 Beat_7L05 Beat_7L06 Beat_7L07 Beat_7L08 Beat_7M01 Beat_7M02 Beat_7M03 Beat_7M04 Beat_7N02 Beat_7N03 Beat_7O01 Beat_7O02 Beat_7O04 Beat_7O05 Beat_7O06 Beat_7O07 Beat_7O08 Beat_7O09 Beat_7O10 Beat_7P01 Beat_7P02 Beat_7P03 Beat_7P04 Beat_7P05 Beat_7P06 Beat_7P07 Beat_7Q01 Beat_7Q02 Beat_7Q03 Beat_7Q04 Beat_7R01 Beat_7R02 Beat_7R03 Beat_7R04 Beat_7R05 Beat_7R06 Beat_7R08 Beat_7S01 Beat_8A01 Beat_8A02 Beat_8A03 Beat_8B01 Beat_8B02 Beat_8C01 Beat_8C02 Beat_8C03 Beat_8D01 Beat_8D02 Beat_8D03 Beat_8D04 Beat_8D05 Beat_8D06 Beat_8E01 Beat_8E02 Beat_8E03 Beat_8E04 Beat_8E05 Beat_8F01 Beat_8F02 Beat_8F03 Beat_8G01 Beat_8G02 Beat_8G03 Beat_8G04 Beat_8H01 Beat_8H02 Beat_8H03 Beat_8H04 Beat_8I01 Beat_8I02 Beat_8J01 Beat_8J02 Beat_8J03 Beat_8J04 Beat_ARABI Beat_BRIDGE CITY Beat_CHALMETTE Beat_GRETNA Beat_GRETNA UNINCORP Beat_HARVEY Beat_JEFF Beat_JEFFERSON Beat_KENNER Beat_MARRERO Beat_METAIRIE Beat_PLAQUEMINES Beat_PLQ Beat_POYDRAS Beat_RIVER RIDGE Beat_SLIDELL Beat_ST BERNARD Beat_ST TAMMANY Beat_STB Beat_STT Beat_TERRYTOWN Beat_VIOLET Beat_WESTWEGO Zip_70112.0 Zip_70113.0 Zip_70114.0 Zip_70115.0 Zip_70116.0 Zip_70117.0 Zip_70118.0 Zip_70119.0 Zip_70122.0 Zip_70124.0 Zip_70125.0 Zip_70126.0 Zip_70127.0 Zip_70128.0 Zip_70129.0 Zip_70130.0 Zip_70131.0 Zip_70148.0 Zip_nan SimpleType_Accidents/Traffic Safety SimpleType_Alarms SimpleType_Alcohol SimpleType_Complaints/Environment SimpleType_Domestic Violence SimpleType_Drugs SimpleType_Fire SimpleType_Medical Emergencies SimpleType_Mental Health SimpleType_Missing Persons SimpleType_Not Crime SimpleType_Officer Needs Help SimpleType_Other SimpleType_Property SimpleType_Public Assistance SimpleType_Sex Offenses SimpleType_Status SimpleType_Suspicion SimpleType_Violent Crime SimpleType_Warrants
0 1580175440 4 -90.045256 29.947510 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False
1 1580254843 1 -90.081166 29.970404 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False
2 1580228880 6 -90.073690 29.941134 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False True False False False
3 1580250204 4 -90.031894 29.949838 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False
4 1580240017 4 -89.994269 29.903137 29.961679 -90.038803 2.4 0.01 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False True False False False False False False False False False False False False False False False False False False False

Notice that we now have 672 columns now - since for every one categorical column with n possible values, we now have n binary columns.

In [ ]:
X.shape
Out[ ]:
(1072855, 672)

What model type is best for this data? While preliminary, given that the majority of our data is categorical, a decision tree would probably work best. These models can parse the various underlying groups in the categorical data better than something like a K-nearest-neighbors model, since binary variables only have a distance between each other of 0 or 1.

We will use the sklearn DecisionTreeRegressor as a quick and lightweight trial version to see if a model of this kind shows any promise. We can also use the sklearn cross_val_score method to quickly train and evaluate our model across a number of random training and testing splits.

In [ ]:
dt_regressor = DecisionTreeRegressor(random_state = 0)
cross_val_score(dt_regressor, X, y, cv = 5, verbose = 3, n_jobs = -1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   48.0s remaining:  1.2min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   51.8s finished
Out[ ]:
array([-0.00145392,  0.0521779 ,  0.04176776,  0.03175554,  0.02556702])

The array that this returns is the mean squared error for our model over each different random training and testing split. Note that negative values here are okay, the error is just the absolute value of each of these numbers. Our model seems to perfom pretty well generally, with a low cross-validation error given the complexity of the data. However, this is just preliminary - there is no limit on model complexity, hyperparameter optimization, or other basic machine learning due-dilligence, but this indicates that this would be an interesting avenue to explore.

Predicting Precipitation from Calls For Service¶

In our manual analysis, we showed that there was a significant difference between rates of events of each kind given the precipitation recorded at the nearest weather station. Instead of trying to predict incidents based on precipitation, can we reconstruct the signal to use the calls for service data to predict precipitation amounts on that day?

In [ ]:
calls_weather_master.head()
Out[ ]:
NOPD_Item Type TypeText Priority InitialType InitialTypeText InitialPriority MapX MapY TimeCreate TimeDispatch TimeArrive TimeClosed Disposition DispositionText SelfInitiated Beat BLOCK_ADDRESS Zip PoliceDistrict Location Type_ TimeArrival Longitude_x Latitude_x SimpleType DateCreate PairedStation Name Latitude_y Longitude_y Elevation AverageDailyWind NumDaysPrecipAvg FastestWindTime MultidayPrecipTotal PeakGustTime Precipitation Snowfall MinSoilTemp TimeAvgTemp TimeMaxTemp TimeMinTemp TempAtObs 2MinMaxWindDirection 5MinMaxWindDirection 2MinMaxWindSpeed 5MinMaxWindSpeed Fog Heavy Fog Thunder Ice Pellets Hail Rime Smoke Tornado High Wind Mist Drizzle Rain Snow Ground Fog
0 A3472220 22A AREA CHECK 1K 22A AREA CHECK 1K 3688756.0 528696.0 2020-01-28 01:37:20 2020-01-28 01:37:20 2020-01-28 01:37:28 2020-01-28 02:25:50 NAT Necessary Action Taken N 4G04 Atlantic Ave & Slidell St 70114.0 4 POINT (-90.04525645 29.94750953) NaN NaN -90.045256 29.947510 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
1 A3605320 18 TRAFFIC INCIDENT 1J 18 TRAFFIC INCIDENT 1J 3677293.0 536895.0 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-28 23:40:43 2020-01-29 00:01:34 NAT Necessary Action Taken Y 1J03 026XX Saint Ann St 70119.0 1 POINT (-90.08116628 29.97040355) NaN NaN -90.081166 29.970404 Accidents/Traffic Safety 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
2 A3557120 58 RETURN FOR ADDITIONAL INFO 0A 58 RETURN FOR ADDITIONAL INFO 1I 3679778.0 526277.0 2020-01-28 16:28:00 2020-01-28 21:33:43 2020-01-28 21:33:48 2020-01-28 23:01:01 NAT Necessary Action Taken N 6E04 012XX Saint Charles Ave 70130.0 6 POINT (-90.07368951 29.94113385) NaN NaN -90.073690 29.941134 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
3 A3600220 58 RETURN FOR ADDITIONAL INFO 1I 58 RETURN FOR ADDITIONAL INFO 1I 3692978.0 529591.0 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:23:24 2020-01-28 22:52:46 NAT Necessary Action Taken Y 4H02 024XX Sanctuary Dr 70114.0 4 POINT (-90.03189416 29.94983828) NaN NaN -90.031894 29.949838 Status 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False
4 A3583520 TS TRAFFIC STOP 1J TS TRAFFIC STOP 1J 3705091.0 512746.0 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:33:37 2020-01-28 19:45:08 NAT Necessary Action Taken Y 4D05 057XX Tullis Dr 70131.0 4 POINT (-89.99426931 29.90313678) NaN NaN -89.994269 29.903137 Accidents/Traffic Safety 2020-01-28 US1LAOR0006 NEW ORLEANS 2.1 ENE, LA US 29.961679 -90.038803 2.4 NaN NaN NaN NaN NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False False

Let's select the data we want to use to train our model. We only want the calls for service data, and precipitation at the nearest station - we drop all of the empty, text-based, or identifying variables as before, but also most of the other weather data that could serve as a proxy for the precipitation value. This ensures that the model is only using information about the incident to predict precipitation, and not whether or not some other binary weather attribute was recorded.

In [ ]:
cwm_ml_precip = calls_weather_master.copy()
cwm_ml_precip.drop(["NOPD_Item", "FastestWindTime", "MinSoilTemp", "Type_", "PeakGustTime", "TimeAvgTemp", "MultidayPrecipTotal", "NumDaysPrecipAvg", "TempAtObs", 
             "AverageDailyWind", "5MinMaxWindSpeed", "5MinMaxWindDirection", "2MinMaxWindDirection", "2MinMaxWindSpeed", "TimeMinTemp", "TimeMaxTemp", "Snowfall", "Disposition", "DispositionText",
            "MapX", "MapY", "Location", "PairedStation", "Type", "TypeText", "InitialType", "InitialTypeText", "Name", "BLOCK_ADDRESS", "DateCreate", "TimeArrival",
                   "TimeDispatch", "TimeArrive", "Fog", "Heavy Fog", "Thunder", "Ice Pellets", "Hail", "Rime", "Smoke", "Tornado","High Wind", "Mist", "Drizzle", "Rain", "Snow", "Ground Fog"], axis = 1, inplace = True) 


cwm_ml_precip.head()
Out[ ]:
Priority InitialPriority TimeCreate TimeClosed SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation Precipitation
0 1K 1K 2020-01-28 01:37:20 2020-01-28 02:25:50 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4 0.01
1 1J 1J 2020-01-28 23:40:43 2020-01-29 00:01:34 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01
2 0A 1I 2020-01-28 16:28:00 2020-01-28 23:01:01 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4 0.01
3 1I 1I 2020-01-28 22:23:24 2020-01-28 22:52:46 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4 0.01
4 1J 1J 2020-01-28 19:33:37 2020-01-28 19:45:08 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4 0.01

As before, we want to keep our creation and closing time attributes, but converted to epoch seconds.

In [ ]:
cwm_ml_precip['TimeCreate'] = (cwm_ml_precip['TimeCreate'] - pd.Timestamp("1970-01-01")) // pd.Timedelta(seconds=1)
cwm_ml_precip['TimeClosed'] = (cwm_ml_precip['TimeClosed'] - pd.Timestamp("1970-01-01")) // pd.Timedelta(seconds=1)

Finally, since we're predicting precipitation rate, we need to drop any rows where we don't have data to predict - which means only selecting data where the precipitation value is not empty.

In [ ]:
cwm_ml_precip = cwm_ml_precip[cwm_ml_precip["Precipitation"].notna()]

Again as before, we split our data into our X and y, or our observations and our target features.

In [ ]:
y = cwm_ml_precip["Precipitation"]
X = cwm_ml_precip.copy().drop(["Precipitation"], axis = 1)
In [ ]:
X.head()
Out[ ]:
Priority InitialPriority TimeCreate TimeClosed SelfInitiated Beat Zip PoliceDistrict Longitude_x Latitude_x SimpleType Latitude_y Longitude_y Elevation
0 1K 1K 1580175440 1580178350 N 4G04 70114.0 4 -90.045256 29.947510 Status 29.961679 -90.038803 2.4
1 1J 1J 1580254843 1580256094 Y 1J03 70119.0 1 -90.081166 29.970404 Accidents/Traffic Safety 29.961679 -90.038803 2.4
2 0A 1I 1580228880 1580252461 N 6E04 70130.0 6 -90.073690 29.941134 Status 29.961679 -90.038803 2.4
3 1I 1I 1580250204 1580251966 Y 4H02 70114.0 4 -90.031894 29.949838 Status 29.961679 -90.038803 2.4
4 1J 1J 1580240017 1580240708 Y 4D05 70131.0 4 -89.994269 29.903137 Accidents/Traffic Safety 29.961679 -90.038803 2.4

And we similarly convert our categorical data into a list of binary attributes for input into our model.

In [ ]:
X = pd.get_dummies(X)
X.head()
Out[ ]:
TimeCreate TimeClosed PoliceDistrict Longitude_x Latitude_x Latitude_y Longitude_y Elevation Priority_0 Priority_0A Priority_0B Priority_0C Priority_0D Priority_0E Priority_0F Priority_0G Priority_0H Priority_0I Priority_0U Priority_0V Priority_0W Priority_0X Priority_0Y Priority_0Z Priority_1 Priority_1A Priority_1B Priority_1C Priority_1D Priority_1E Priority_1F Priority_1G Priority_1H Priority_1I Priority_1J Priority_1K Priority_1L Priority_1M Priority_1N Priority_1P Priority_1R Priority_1S Priority_1U Priority_1V Priority_1W Priority_1X Priority_1Y Priority_1Z Priority_2 Priority_2A Priority_2B Priority_2C Priority_2D Priority_2E Priority_2F Priority_2G Priority_2H Priority_2J Priority_2P Priority_2Q Priority_3 Priority_3A Priority_3C Priority_4A InitialPriority_0 InitialPriority_0A InitialPriority_0B InitialPriority_0C InitialPriority_0D InitialPriority_0E InitialPriority_0F InitialPriority_0G InitialPriority_0H InitialPriority_0I InitialPriority_0R InitialPriority_0Z InitialPriority_1 InitialPriority_1A InitialPriority_1B InitialPriority_1C InitialPriority_1D InitialPriority_1E InitialPriority_1F InitialPriority_1G InitialPriority_1H InitialPriority_1I InitialPriority_1J InitialPriority_1K InitialPriority_1Y InitialPriority_1Z InitialPriority_2 InitialPriority_2A InitialPriority_2B InitialPriority_2C InitialPriority_2D InitialPriority_2E InitialPriority_2F InitialPriority_2G InitialPriority_2H InitialPriority_2J InitialPriority_2Q InitialPriority_3A InitialPriority_3C InitialPriority_` SelfInitiated_N SelfInitiated_Y Beat_1A01 Beat_1B01 Beat_1C01 Beat_1C02 Beat_1C03 Beat_1C04 Beat_1E01 Beat_1E02 Beat_1E03 Beat_1E04 Beat_1E05 Beat_1F01 Beat_1F02 Beat_1G01 Beat_1G02 Beat_1H01 Beat_1H02 Beat_1I01 Beat_1I02 Beat_1I03 Beat_1I04 Beat_1J01 Beat_1J02 Beat_1J03 Beat_1J04 Beat_1K01 Beat_1K02 Beat_1L01 Beat_1L02 Beat_1L03 Beat_1L04 Beat_1M01 Beat_1M02 Beat_1M03 Beat_1M04 Beat_1M05 Beat_1N01 Beat_1N02 Beat_1N03 Beat_2A01 Beat_2A02 Beat_2A03 Beat_2A04 Beat_2A05 Beat_2B01 Beat_2B02 Beat_2B03 Beat_2B04 Beat_2B05 Beat_2D01 Beat_2D02 Beat_2D03 Beat_2E01 Beat_2E02 Beat_2E03 Beat_2E04 Beat_2F01 Beat_2F02 Beat_2F03 Beat_2F04 Beat_2H01 Beat_2H02 Beat_2H03 Beat_2H04 Beat_2H05 Beat_2I01 Beat_2I02 Beat_2I03 Beat_2K01 Beat_2K02 Beat_2K03 Beat_2K04 Beat_2L01 Beat_2L02 Beat_2L03 Beat_2L04 Beat_2L05 Beat_2M01 Beat_2M02 Beat_2M03 Beat_2M04 Beat_2M05 Beat_2M06 Beat_2N01 Beat_2N02 Beat_2N03 Beat_2O01 Beat_2P01 Beat_2P02 Beat_2Q01 Beat_2Q02 Beat_2Q03 Beat_2R01 Beat_2S01 Beat_2S02 Beat_2S03 Beat_2T01 Beat_2T02 Beat_2T03 Beat_2T04 Beat_2T05 Beat_2T06 Beat_2U01 Beat_2U02 Beat_2U03 Beat_2U04 Beat_2V01 Beat_2V02 Beat_2V03 Beat_3A01 Beat_3A02 Beat_3A03 Beat_3B01 Beat_3B02 Beat_3B03 Beat_3B04 Beat_3C01 Beat_3C02 Beat_3C03 Beat_3D01 Beat_3D02 Beat_3E01 Beat_3E02 Beat_3F01 Beat_3F02 Beat_3G01 Beat_3G02 Beat_3H01 Beat_3H02 Beat_3I01 Beat_3I02 Beat_3I03 Beat_3I04 Beat_3I05 Beat_3I06 Beat_3J01 Beat_3J02 Beat_3J03 Beat_3J04 Beat_3K01 Beat_3K02 Beat_3K03 Beat_3L01 Beat_3L02 Beat_3L03 Beat_3L04 Beat_3M01 Beat_3M02 Beat_3M03 Beat_3M04 Beat_3M05 Beat_3N01 Beat_3N02 Beat_3O01 Beat_3O02 Beat_3P01 Beat_3P02 Beat_3P03 Beat_3P04 Beat_3P05 Beat_3Q01 Beat_3Q02 Beat_3Q03 Beat_3Q04 Beat_3R01 Beat_3R02 Beat_3R03 Beat_3R04 Beat_3S01 Beat_3S02 Beat_3S03 Beat_3S04 Beat_3S05 Beat_3S06 Beat_3S07 Beat_3S08 Beat_3S09 Beat_3S10 Beat_3T01 Beat_3T02 Beat_3U01 Beat_3U02 Beat_3U03 Beat_3U04 Beat_3V01 Beat_3V02 Beat_3V03 Beat_3X01 Beat_3X02 Beat_3X03 Beat_3Y01 Beat_3Y02 Beat_3Y03 Beat_3Y04 Beat_4A01 Beat_4A02 Beat_4A03 Beat_4A04 Beat_4A05 Beat_4A06 Beat_4B01 Beat_4B02 Beat_4B03 Beat_4B04 Beat_4B05 Beat_4C01 Beat_4C02 Beat_4C03 Beat_4D01 Beat_4D02 Beat_4D03 Beat_4D04 Beat_4D05 Beat_4E01 Beat_4E02 Beat_4E03 Beat_4E04 Beat_4E05 Beat_4F01 Beat_4F02 Beat_4G01 Beat_4G02 Beat_4G03 Beat_4G04 Beat_4H01 Beat_4H02 Beat_4I01 Beat_4I02 Beat_4I03 Beat_4I04 Beat_4J01 Beat_4J02 Beat_4K01 Beat_4K02 Beat_4K03 Beat_5A01 Beat_5A02 Beat_5A03 Beat_5A04 Beat_5A05 Beat_5B01 Beat_5B02 Beat_5B03 Beat_5C01 Beat_5C02 Beat_5C03 Beat_5C04 Beat_5D01 Beat_5D02 Beat_5D03 Beat_5D04 Beat_5E01 Beat_5E02 Beat_5E03 Beat_5E04 Beat_5G01 Beat_5G02 Beat_5G03 Beat_5G04 Beat_5H01 Beat_5H02 Beat_5H03 Beat_5H04 Beat_5H05 Beat_5I01 Beat_5I02 Beat_5I03 Beat_5I04 Beat_5I05 Beat_5I06 Beat_5K01 Beat_5K02 Beat_5K03 Beat_5K04 Beat_5L01 Beat_5L02 Beat_5L03 Beat_5M01 Beat_5M02 Beat_5M03 Beat_5M04 Beat_5Q01 Beat_5Q02 Beat_5Q03 Beat_5Q04 Beat_5R01 Beat_6A01 Beat_6A02 Beat_6A03 Beat_6A04 Beat_6A05 Beat_6A06 Beat_6B01 Beat_6B02 Beat_6B03 Beat_6B04 Beat_6B05 Beat_6B06 Beat_6B07 Beat_6C01 Beat_6D01 Beat_6D02 Beat_6D03 Beat_6E01 Beat_6E02 Beat_6E03 Beat_6E04 Beat_6F01 Beat_6F02 Beat_6F03 Beat_6F04 Beat_6F05 Beat_6F06 Beat_6G01 Beat_6G02 Beat_6G03 Beat_6G04 Beat_6H01 Beat_6H02 Beat_6H03 Beat_6H04 Beat_6H05 Beat_6H06 Beat_6I01 Beat_6I02 Beat_6I03 Beat_6I04 Beat_6J01 Beat_6J02 Beat_6J03 Beat_6J04 Beat_6J05 Beat_6K01 Beat_6K02 Beat_6L01 Beat_6L02 Beat_6M01 Beat_6M02 Beat_6M03 Beat_6M04 Beat_6M05 Beat_6N01 Beat_6N02 Beat_6O01 Beat_6O02 Beat_6O03 Beat_6P01 Beat_6Q01 Beat_6Q02 Beat_6Q03 Beat_6Q04 Beat_6Q05 Beat_7A01 Beat_7A02 Beat_7A03 Beat_7A04 Beat_7A05 Beat_7A06 Beat_7B01 Beat_7B03 Beat_7B04 Beat_7C01 Beat_7C02 Beat_7C03 Beat_7C04 Beat_7C05 Beat_7D01 Beat_7D02 Beat_7D03 Beat_7D04 Beat_7D05 Beat_7D06 Beat_7D07 Beat_7D08 Beat_7E01 Beat_7F01 Beat_7F02 Beat_7F03 Beat_7F04 Beat_7G01 Beat_7G02 Beat_7G03 Beat_7G04 Beat_7G05 Beat_7G06 Beat_7H01 Beat_7H02 Beat_7H03 Beat_7I01 Beat_7I02 Beat_7I03 Beat_7I04 Beat_7I05 Beat_7I06 Beat_7I07 Beat_7I08 Beat_7I09 Beat_7I10 Beat_7I11 Beat_7I13 Beat_7I14 Beat_7J01 Beat_7J02 Beat_7J03 Beat_7J04 Beat_7K01 Beat_7K02 Beat_7K03 Beat_7K04 Beat_7K05 Beat_7K06 Beat_7K07 Beat_7L01 Beat_7L02 Beat_7L03 Beat_7L04 Beat_7L05 Beat_7L06 Beat_7L07 Beat_7L08 Beat_7M01 Beat_7M02 Beat_7M03 Beat_7M04 Beat_7N02 Beat_7N03 Beat_7O01 Beat_7O02 Beat_7O04 Beat_7O05 Beat_7O06 Beat_7O07 Beat_7O08 Beat_7O09 Beat_7O10 Beat_7P01 Beat_7P02 Beat_7P03 Beat_7P04 Beat_7P05 Beat_7P06 Beat_7P07 Beat_7Q01 Beat_7Q02 Beat_7Q03 Beat_7Q04 Beat_7R01 Beat_7R02 Beat_7R03 Beat_7R04 Beat_7R05 Beat_7R06 Beat_7R08 Beat_7S01 Beat_8A01 Beat_8A02 Beat_8A03 Beat_8B01 Beat_8B02 Beat_8C01 Beat_8C02 Beat_8C03 Beat_8D01 Beat_8D02 Beat_8D03 Beat_8D04 Beat_8D05 Beat_8D06 Beat_8E01 Beat_8E02 Beat_8E03 Beat_8E04 Beat_8E05 Beat_8F01 Beat_8F02 Beat_8F03 Beat_8G01 Beat_8G02 Beat_8G03 Beat_8G04 Beat_8H01 Beat_8H02 Beat_8H03 Beat_8H04 Beat_8I01 Beat_8I02 Beat_8J01 Beat_8J02 Beat_8J03 Beat_8J04 Beat_ARABI Beat_BRIDGE CITY Beat_CHALMETTE Beat_GRETNA Beat_GRETNA UNINCORP Beat_HARVEY Beat_JEFF Beat_JEFFERSON Beat_KENNER Beat_MARRERO Beat_METAIRIE Beat_PLAQUEMINES Beat_PLQ Beat_POYDRAS Beat_RIVER RIDGE Beat_SLIDELL Beat_ST BERNARD Beat_ST TAMMANY Beat_STB Beat_STT Beat_TERRYTOWN Beat_VIOLET Beat_WESTWEGO Zip_70112.0 Zip_70113.0 Zip_70114.0 Zip_70115.0 Zip_70116.0 Zip_70117.0 Zip_70118.0 Zip_70119.0 Zip_70122.0 Zip_70124.0 Zip_70125.0 Zip_70126.0 Zip_70127.0 Zip_70128.0 Zip_70129.0 Zip_70130.0 Zip_70131.0 Zip_70148.0 Zip_nan SimpleType_Accidents/Traffic Safety SimpleType_Alarms SimpleType_Alcohol SimpleType_Complaints/Environment SimpleType_Domestic Violence SimpleType_Drugs SimpleType_Fire SimpleType_Medical Emergencies SimpleType_Mental Health SimpleType_Missing Persons SimpleType_Not Crime SimpleType_Officer Needs Help SimpleType_Other SimpleType_Property SimpleType_Public Assistance SimpleType_Sex Offenses SimpleType_Status SimpleType_Suspicion SimpleType_Violent Crime SimpleType_Warrants
0 1580175440 1580178350 4 -90.045256 29.947510 29.961679 -90.038803 2.4 False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False
1 1580254843 1580256094 1 -90.081166 29.970404 29.961679 -90.038803 2.4 False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False
2 1580228880 1580252461 6 -90.073690 29.941134 29.961679 -90.038803 2.4 False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False True False False False
3 1580250204 1580251966 4 -90.031894 29.949838 29.961679 -90.038803 2.4 False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False
4 1580240017 1580240708 4 -89.994269 29.903137 29.961679 -90.038803 2.4 False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False True False False False False False False False False False False False False False False False False False False False

Although our underlying data is different, we will still use the DecisionTreeRegressor for the same reasons - our data is much the same shape and makeup, and we are trying to regress a continuous value from it. We also use the cross_val_score function again to see how our model performs over five random splits of training and testing data.

In [ ]:
dt_regressor = DecisionTreeRegressor(random_state = 0)
cross_val_score(dt_regressor, X, y, cv = 5, verbose = 3, n_jobs = -1)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   47.0s remaining:  1.2min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   52.1s finished
Out[ ]:
array([-0.13535882, -0.00842982, -0.03389795, -0.00234448, -0.06655831])

Suprisingly, our model generally performs pretty well. This simple decision tree is theoretically able to regress precipitation amount on a day given an incident report. However, this also lacks due diligence - specifically, multiple events happen on days, so with no bound on complexity of the model other reports from the same day could be violating the separation of training and testing data and acting as proxies. There is also a large class imbalance in the labeled data - around 66% of the precipitation readings are 0, so a naive classifier of zero versus non-zero precipitation would already have a 66% accuracy by always predicting 0. As such, stratification for the splits should be implemented, as well as better preprocessing and more rigorous training and diagnostic values.

Closing Thoughts and Final Goals¶


Starting from our prior milestone, we have:

  • For the Calls for Service Data:
    • Established simpler incident type bins
    • Finished formatting and cleaning entity literals, extracting location, dates, times, etc.,
  • For the NOAA data:
    • Succesfully opened the low-memory file, translating it into meaningful variables and discarding artifacts
  • Performed cacheable entity matching based on date and minimum geographic distance
  • Analyzed:
    • The relationship between the presence of precipitation and volume of incident types, and successfully demonstrated a significant difference across the dataset
    • The distribution of violent vs. nonviolent incidents across maximum temperatures on days, and showed a meaningful difference in the distributions above and below 90 degrees fahrenheit
    • Explored the geographic distribution of violent incidents on days with extremely high or low temperatures with an interactive map element
  • Modeled:
    • Tenatively the total time duration of an incident given joint information about the incident and weather on that day
    • Tentatively the precipitation rate on a day given the information about incidents on that day

For our final submission, we aim to:

  • Flesh out our models, with proper preprocessing due dilligence such as scaling, stratification, transformations, and diagnostics, as well as creating more advanced models and applying their results
  • Improve the entity matching process to account for the differences in measurement capacity across NOAA weather stations, and recompute analysis figures
  • Potentially incorporate additional data from different incident logs (namely, 311 call data for non-emergencies)

So far, we are pleased with the results we have found. Our analysis lines up in some capacity with our prior hypothesis, namely that there is a difference in incident volume based on precipitation (people do fewer things that cause incidents when it's raining outside), and that violent incidents happen across slightly higher temperatures than non-violent ones. However, some of our hypotheses have been challeneged - violent incidents on days with extreme temperatures are not noticelably disproportionate in geographical areas, and we are suprised at our very tentative ability to model both incident duration, and precipitation as a function of our joint dataset. There's lots of interesting questions in this joint data, and by no means are we limited to these concepts moving forwards if we find something else novel, but it shows promise for meaningful and suprising results as a part of our final submission.